Report: Speech Recognition and the Mobile Interface


Speech ReportOne of the benefits of working for the GigaOM Network is gaining exposure to a lot of great resources. The recently launched GigaOM Pro is a major source of information that touches on a lot of what we do. GigaOM Pro has published a technical report (subscription required) that delves into the impact that speech technology can have for mobile uses such as those performed on a phone. I have long been an advocate of speech recognition, and I believe it can play a part in creating a natural interface for working with computers. Speaking your mind takes on a special meaning when it’s done in order to interact with a computer. That interaction can benefit when applied to the mobile phone, perhaps even more so than with speech-enabled computers.

The mobile phone is a personal device that has become ingrained in most everything we do in our lives. It is a device that is designed from the ground up to work with speech, and it makes great sense that a proper interface revolving around speech technology could be a huge benefit.

The report touches on current real-world applications that leverage the use of speech to their advantage — GOOG411, for example. It makes sense to speak your queries when possible for the ultimate ease of use. These applications are possible due to more capable hardware and using the cloud. Phones now have good processing power, and having the heavy lifting done by remote servers (the cloud) maximizes what can be accomplished with speech. Speech recognition takes a lot of processing power, and the report points out that local and remote resources are now sufficient to do a great deal.

I have used speech recognition for years, and the thought of a completely speech-enabled phone excites me. I would love to approach the “Star Trek” era by speaking commands to my phone and having it react appropriately. We do that on a restricted level currently, as in the case of voice dialing. That is speech recognition in its most basic form, and expanding that capability would only be better. As the report indicates, the next 12 to 24 months will see this ability spread much further toward total speech operation of our phones. Heck, my Bluetooth headset has speech recognition of its own that responds to basic commands.

Phones operated by speech will have to overcome a couple of barriers to adoption, in my opinion. My experience with speech recognition, and talking to others about it, has me convinced that many are embarrassed at the thought of operating a phone by voice in public. I’ve had many people tell me that’s why they don’t use headsets, either — they don’t like to be seen by others talking to their phone. Public places are often too noisy to allow accurate speech recognition, and this will have to be solved before widespread adoption can take place. The speech interface must work everywhere, no matter what, to gain widespread adoption. These barriers are just a result of human nature, and speech is a very human phenomenon, so they will have to be addressed.

The GigaOM Pro report is a good, comprehensive look at bringing a speech interface to the mobile phone.  I recommend you take a long look at this report if you are a GigaOM Pro subscriber, and if not, maybe you should think hard about becoming one.


Arthur Bartlett

Wow, just like THEAPPLEBLOG you sold out. I feared this would happen when GIGAOM bought your site. It is becoming harder and harder to visit your site given GIGAOM’s control but old habits die hard. However, that may soon change. Advertising acceptable, subscription not! Please, no more teasers.

Joe Stafura

I’ve been involved with speech technology since 2000, helping to start a text to speech company (TTS) and consulting with a Speech Recognition company (ASR).

The most significant lesson learned was how hard it is to make advancements in the performance of the core technology, as mentioned in an earlier post. The advancements are driven by adapting the context in a way that improves performance.

In typing programs this is learning process of your voice qualities first and then your grammar, the words and phrases that you use a lot. For medical or legal transcriptions limiting the domain improves performance in both TTS and ASR.

It appears that there is a chance that the current approaches can get no closer to the holy grail of a recognizer that can understand almost anyone saying almost anything while being almost anywhere.

For now speech has an important role in hands free applications like phones and cars, and the savings in medical transcription is projected to be large.


I have loved the idea of voice recognition, however it has not advanced much since the apple scripts I use to write on my apple OS 6 or 7 (I forget which) the accuracy was pretty good. The current systems with years of development have only marginally improved over the built in system on older apple systems…


Personally I really do not want speach recognition as an interface in most situations. Maybe at home or when I am in a private office, etc.; otherwise I just want a simple touch type keyboard. I think the UMPC and portable marketplace has lost touch with how most people really want to use technology. That is probably why UMPC’s have had such weak sales as they for the most part have no real keyboard and only some are small enough to be more portable than any typical laptop, etc.


I would think talking to a phone would be the least embarrassing way to use voice command. Someone talking *to* their phone looks the same as someone talking *on* their phone. The key would be to make the commands sound more natural. For example, instead of “Call Julie,” the command should be “Please call Julie.” Adding “please” makes it sound like you’re asking someone to do something, which is a completely normal thing to do over the phone. Throw a “hey” in front of that, and it’s a conversation you might hear in an elevator.

Comments are closed.