Siri dropped like a bomb on wireless consumers last fall, becoming not only one of the iPhone’s most popular features but also an immediate icon in popular culture and a constant talking point in industry circles. The voice-recognition technology has been the subject of numerous parodies and has even been unfairly blamed for the skyrocketing mobile broadband consumption. But mostly Siri has been a big hit for Apple, helping make the iPhone 4S the top-selling smartphone in the U.S., and it has pointed a huge spotlight on the possibilities for voice-driven user interfaces and natural language understanding in mobile.
So the question on every mobile developer’s mind must be, “How do I get in on the action?”
Working around Siri
A look at Apple’s patent applications shows that the company has big plans for Siri beyond the iPhone itself. The company likely plans to make the voice assistant a key part of its user interface technology across its product lines.
That means working with Apple would seem like the logical choice for developers to grab a piece of the action, but Apple, in typical fashion, has closed that avenue for now. Siri is still technically in beta, but Apple isn’t making Siri’s APIs available to developers. And there is no guarantee that will change once Siri leaves beta.
Apple has, however, opened Siri’s dictation functions to any application, allowing customers to use their voice as a stand-in for the soft keypad. For instance, you can dictate a tweet or Facebook update. Some particularly crafty developers have even leveraged those dictation features to create ad hoc voice recognition for their apps.
Grocery list app ZipList uses Siri-dictated SMS messages to compile shopping lists. Instead of using Siri to interact directly with ZipList’s app, customers tell Siri, “Tell ZipList to add pickles.” That generates a text message that is sent to ZipList’s servers. Task manager app Remember the Milk, meanwhile, doesn’t use Siri dictation. Rather, it taps into the iPhone’s calendar app, which is where Siri places “reminders” you give it. By syncing with the calendar, Remember the Milk generates to-do lists.
Of course, both those examples are work-arounds because of the lack of direct API access, and such implementations have some pretty big limitations. They not only require customers to configure settings and accounts on their phone but also strip out what is arguably the most useful part of Siri: its ability to understand the context of natural speech. Once Siri is done taking dictation, it is up to apps like the aforementioned two to figure out what the speaker meant. Sure, you can tell ZipList to add a “pound of a butter” to your shopping list by literally entering “1 lb. of butter” into the app. But if you tell the app to remove “the butter” from that list, it has no idea what you are talking about.
If developers want to approximate the usefulness of Siri, they need to integrate voice recognition and natural language understanding deeper into their apps, and today there are only a few avenues for doing so. Siri is closed off for now, as is Microsoft’s TellMe technology for Windows Phone 7 developers.
That leaves essentially two choices: Google and Nuance Communications. Let’s take them one at a time.
Putting the action in Voice Actions
Google had voice services out long before Apple: The company has included a voice-enabled “keyboard” in the OS stack in every release since Android 2.1, and it makes its server-based speech recognition and language databases available to all developers at no charge.
Developers can map speech input into particular functions of their apps with just a few lines of code, and then they can select a language database to help interpret what users actually say. For instance, if you are building an SMS or email dictation app, you would select the “free form” language model, which translates speech into text literally. For an app that is searching the Internet, the “Web search” model would be a better choice, since it would try to winnow out short search-like terms from a user’s speech to return the most accurate results.
Google has packaged a lot of these functions into an app it calls Voice Actions, which initiates many basic but useful functions on the device: getting navigation directions, playing music, dictating SMS and email, calling a contact, and, of course, Web search.
But as my colleague Kevin Tofel pointed out in his comparison between Voice Actions and Siri, Google’s voice-interface technology has big limitations.
Voice Actions requires you to memorize specific commands, such as “Send a text message to . . .” or “Play the song . . .” Siri, on the other hand, can understand — albeit within limits — conversational language and context. In short, as Kevin pointed out in his recent long view on the invisible interface, Siri has limited semantic understanding, but Google’s voice technology does not have any. That is why Siri can understand questions like “Will I need an umbrella today?” and “Where is my wife?”
Google has built its voice technology on a rules-based model, said Dave Grannan, the president and CEO of Vlingo, an independent speech recognition and language intent company that is being purchased by Nuance. After its servers translate speech into its component words, Google tries to plug them into predefined rules, sort of like simple “if-then” programming commands. If Google can’t match those words to a specific rule, it can’t do anything with the speech, Grannan said.
That is not to say that Google isn’t capable of developing natural language understanding. In fact, it reportedly already has, in a mysterious project code-named Majel, and is only waiting for the best time to launch. And when it does, developers will likely have a powerful, open and — best of all — free resource available to them to power the voice features of their apps, Grannan said.
“It looks like Google is working hard on natural language understanding for its next release,” Grannan said. “They want to have a Siri-like experience for Android, but they also want to use it as a competitive differentiator. The typical Google approach to the market is to release something for free and by doing so eliminate some of the competition.”
The nuance of Nuance
Though not a huge company unto itself, Nuance is a giant in speech recognition. Its technology is integrated into phones, apps, cars, call centers, TVs and many other consumer appliances. Its core technology even helps power Siri, though Nuance won’t provide the specific details of its relationship with Apple. With its acquisition of primary independent competitor Vlingo, Nuance will become even more of a force.
Nuance also has one of the more robust developer programs, which makes sense considering its business model. Unlike Microsoft, Google and Apple, which use voice recognition to power their own services, Nuance depends on licensing its technology to others, though it has released a few consumer apps of its own, such as Dragon Dictation and its semantic search platform Dragon Go. Big retail brands like Amazon, Wal-Mart and Target have already integrated Nuance technology into their apps, as have more-general information search companies like Merriam-Webster and Ask.com. If you are able to talk to your car, there is a good chance you are talking to a Nuance server.
But a developer doesn’t have to have the resources of a Wal-Mart or a Ford to tap into Nuance’s technology, said Vlad Sejnoha, Nuance’s CTO. At Nuance’s most basic level, developers get free access to its speech recognition and speech-to-text conversion APIs (though there are limits as to how far a developer can scale an app without a paid license), providing similar functionality to Google’s technology. The big difference is that Nuance’s technology works across the iOS, Android and Windows Phone platforms, not just Android. If a developer is looking to add dictation capabilities to an app or have it understand simple memorized commands, this is the way to go.
In the upper paid tiers, however, Nuance starts layering on contextual understanding and artificial intelligence, allowing developers to create more-sophisticated voice interfaces on par with Siri. Rather than merely translating speech into a line of text or an actionable command, the app could tap into deeper language models and semantic algorithms that could infer a speaker’s intent as well as the literal meaning of his words.
Take the grocery list application example: At the most basic level, a developer could use Nuance technology to add and remove items from the list, but at a more sophisticated level, a consumer could actually have a conversation with his grocery app. The app could be made to understand quantities and measurements, tallying up duplicate items and converting partial quantities into sizes that are commonly sold. It could become aware of your past shopping patterns and preferences, questioning why you added salted butter to your list when you always buy unsalted. If the app is linked to a recipe database, you could tell it you want to make chicken Kiev tonight for 10 people and the app would not only pull down the ingredient lists for that recipe but also scale the quantities for the double portion you will have to make.
As developers move up through the tiers, Nuance helps them build their apps to best make use of that contextual understanding, and at the highest tiers it will customize its language databases for a particular developer, giving specialized apps a specialized vocabulary.
“We would sit down with an app developer and say, ‘You have this universe of groceries. We need to capture all of the different ways you can talk about and understand that universe of different things,’” Sejnoha said. “We’re not just talking about words. We’re understanding concepts.”
Of course, that level of integration doesn’t come cheaply. At that tier, development fees are not set but negotiated depending on the complexity of the implementation, and developers typically pay a per-speech-transaction or per-device fee once the app goes commercial. Per-transaction costs start at seven-tenths of a cent, and per-device costs start at 19 cents but range upward depending on the volume and complexity of the speech function.
No matter how much developers integrate with Nuance currently or Google’s future technology, they won’t be able to emulate the Siri experience entirely. That is because Siri isn’t part of any single app but rather an overarching interface that integrates with Apple’s own apps and taps into third-party information sources on the Internet. Since Apple, Google, Microsoft and device makers control the user interface, they ultimately control what apps the personal voice assistant accesses and what services it uses. So unless Apple decides to allow developers access to Siri itself, developers’ speech implementations will be confined to their apps.
That is not necessarily a bad thing, though, Vlingo’s Grannan said. By being the top layer of an encompassing natural language engine, Siri’s understanding is by definition very general. While Apple has built some neat tricks into Siri, its understanding goes only so far, because there is context it just doesn’t have, Grannan said. Ask Siri, “Who won the game today?” and Siri won’t know which game you are talking about, Grannan said. But ask an ESPN app with natural language understanding, he said, and it will know exactly what you mean. The app already knows the teams you follow and the scores you track. For it, “the game” is either obvious or one of a few possible choices. So it can deliver to you all of those scores or ask which one of the few you meant.
“Siri is a great vision for the future, but it might be setting current expectations too high,” Grannan said. “Maybe we need to put up some guard rails.” When Apple launched Siri, it unveiled a grand vision for a new type of conversational interface that current technologies may not be able to live up to, Grannan said. But by shooting the moon, Apple also kick-started a nascent market. Consumers expect to converse with their phones soon, and the voice-recognition and natural language understanding sectors will have to deliver.