Blog Post

Voice Recognition Is Flying, Needs Focus

[qi:83] Back in September 2006, Nokia (NOK) CEO Olli-Pekka Kallasvuo declared that phones were no longer phones but multimedia computers that play back music, record videos, snap photos and — oh, yes — make phone calls. Apple’s (AAPL) iPhone has only reinforced that notion. And as the phone morphs into a multimedia marvel, there is a growing realization that the traditional user interface of a phone, the 12-key keypad, may no longer be enough.

The keypad limits how much information we can input into the tiny devices, and acts as a speed bump when we’re trying to navigate through a complex array of features. And what that means is that we need new ways to interact with mobile devices. Apple, for one, has bet on the touch screen and the fluid UI.

And then there are those who believe that voice input is the way to go.

Microsoft (MSFT) bet about $900 million when it bought TellMe Networks. Some startups are voice believers as well, such as Cambridge, Mass.-based Vlingo Corp, which I’ve previously written about. Earlier this week, I got a chance to see a demo by Yap, a Charlotte, N.C.-based company with a similar approach — that is, taking voice and inputting it as text for everything from IM to navigation.

Like Vlingo, you need to download a Yap application on your mobile phone to get going, and then use voice to enter everything from instant messages to TwitterGrams. Yap also does voice processing on the server side, and then sends information back on the mobile data channel.

There are others who are taking speech recognition even further — embedding it right into the chips that go into Bluetooth headsets. Cambridge Silicon Radio, a maker of Bluetooth chips, is now embedding speech recognition technology from Sunnyvale, Calif.-based Sensory into chips that will find their way into the Bluetooth headsets by the first quarter of 2008.

Sensory CEO Todd Mozer believes that everyone wants to do big things with speech recognition and mobiles, but in the end, the simple functions that enhance the hands-free experience are what make the most sense and are the most useful. Bluetooth devices make perfect sense as a starting point for voice commands. I agree with Mozer.

When I saw the Yap demo, I got the feeling that the application was trying to do too much; it needed some focus. After all, Yahoo (YHOO) and Google (GOOG) can take a similar server-centric voice synthesis approach and provide a more enhanced offering. Moreover, they can use their partnerships with large carriers to squeeze out little players such as Yap.

P.S.: All this interest in voice recognition and related technologies could explain the nice bump in the share price of Nuance (NUAN): up 64 percent for the year, even despite a recent pullback that’s mostly because company’s move into the mobile voice recognition arena isn’t sitting well with the Wall Street types.

Related Post: Sit Up and Listen, Future of Software.

18 Responses to “Voice Recognition Is Flying, Needs Focus”

  1. Jesse Kopelman


    But what if you don’t know what channel the Jetsons are on (or what submenu of your VOD service)? It will take a lot more than three keystrokes to get to them. On the other hand, voice interface coupled with background intelligence would be faster and easier than just a keypad. That’s the thing about voice command; it is not enough for the system to match your speech to the correct phrase, it also has to be able to execute a meaningful series of commands based on that phrase. In the end, VUI is just like GUI, it is the last two letters that are most important.

  2. Great post

    The general observations that I have made on the speech industry are…

    • Initially the speech technology vendors over sold the capability. There were very few good apps, mostly because the UI paradigm was new, and businesses made an incorrect selection of applications with regards to the readiness of the technology.

    • Soon the technology got better predominantly because a) of faster processors and b)Speech vendors had real user data for training, and providers had design and deployment experience. As a result real performance results started to manifest in the contact center for customer care applications – There was focus.

    Today I believe that we are back to where we were, we have new speech technology and solutions that are emerging on mobile phones and edge devices. The prospects of this technology are very exciting. But we must learn from the past, and focus on the applications that have a high success rate given the readiness of the technology. Users don’t give new technology many chances, hence solution provides must get it right out of the gate in both accuracy and adoption rate. As an example both Yap and Vlingo have a very appealing enabling technology, but it’s the success of the initial apps that will realize the ultimate potential of these companies ( I am sure that what I am saying is not new to them).

    Lastly, some folks will like using speech user interfaces and others wont, the speech industry is betting that it will find its niche in the competing UI paradigms. Personally I believe that speech interfaces are not suited for every thing or every one but given the right solution the value delivered could be great.

  3. One of the issues I remember from working with voice portals based upon natural language recognition back in 1999 where the database issues.

    Every question should get some sort of response. (For us it was called the beep problem when there was no entry in the database).

    If the playing field is pretty limited like the traffic reports that isn’t a problem, but when you’re trying to roll out numerous services the numbers of response start to add up.

    I’m currently not in that field anymore but what I remember where the amount of people training the software, so it would be usable.

  4. I’ve always been a proponent of voice-commanded portable devices. If for no other reason because a non-physical interface is optimal as the size of the device shrinks to (near) zero.

    Even though I have small fingers, It has seemed apparent to me that using the keypad on a mobile phone, for instance, is just clunky. There’s only so much variety in input you actually need from a mobile appliance, yet this variety is greater than a 12-key keypad.

    The only other input that I think is worthy of a mobile device’s form factor is some kind of eyeball tracking, but we’re not there yet.

    I’m on the fence about the iphone/ipod touch model, but I can’t get over the idea that a mobile appliance’s access should optimally be limited in scope.

  5. Om, sincerely appreciate your comments. The broad demo was intentional in order to showcase the flexibility of the platform (dependent on the endpoint we’ll have more or less options available to the user). Another thing to remember is that mobile consumers often times do not want to switch contexts; if Yap can support 80% of what you’d normally need a mobile browser for, we’ll be successful…press a button, ask for something, boom, and then we get the heck out of your way.

    As for competition from the portals, if their strategic strength were as you portrayed it, the white label search providers (such as JumpTap and Medio) would not exist. You should hang out on the East coast more to reset your perceptions there. ;-) Our aspirations are well aligned with the carriers strategies, and we expect long and fruitful partnerships with them. Warm regards and thank you for your insightful questions at the event! i.

  6. Alexander van Elsas

    KPN, the Dutch Telecom operator, has done many different experiments with voice recognition in live services. It is a difficult technology to make it work for millions of people. Very simple tasks can be supported, for example there is a traffic jam service that you can call and it tells you if there is a traffic jam on a road that you speak out. There is also a directory service for getting phone numbers.
    Better results are obtained in security though. Voice patterns can be used as security measures in for example banking via phone. As a stand alone technology it is still just not god enough in my opinion.

  7. Voice recognition in phones is nothing new. I still remember the earliest Samsung cellphones, with the “Please say Name” (not a typo) voice dialing.

    While voice recognition has come far, It is still a very clunky way to deal with the system and the novelty wears off after a while. Mistakes by the system are really irritating to deal with when you are trying to get something done .. quick.

    On a lighter note, Jeremy Clarksons infamous tryst with the Mercedes S’s voice recognition should bring a chuckle.. It points out why speech recognition still has its limitations as an input method.

  8. I think voice recognition in some scenarios say looking up numbers or short text messages is a good usage scenario. similarly using it to enter addresses etc for maps, again voice can be very useful.

    Pravin, the scenario here is for the mobile phones, not for home. I guess, at home the remote control always wins.

  9. Why would you want to talk all the time, imagine coming back tired from the office and saying “bring up the jetsons” versus clicking three buttons on your remote and bringing up the show.

    The mouse/remote/keypad wins handsdown (pun intended)

    sometimes we do not like to talk. HAL was irritating for the reason.

  10. I agree that voice is the next big thing but it needs to come alot farther than it has. I hate dealing with any customer service that has voice recognition, it just does not work good enough. Has anyone had any good experiences with it so far? Texting and IM is all fine and dandy but lets get some real world applications set up.