18 Comments

Summary:

Are Google, Yelp and Facebook scared of Siri? If they aren’t they should be, as should any mobile website, service or app that depends on advertising. Siri is the first user interface that shifts our attention away from our phones’ screens, but it won’t be the last.

voice recognition

Is Google scared of Siri? Is Yelp? Is Facebook? If they aren’t they should be, as should any mobile website, service or app that depends on advertising for revenues. Siri is just the beginning of a new wave of user interfaces (UIs) that will gradually shift our attention away from our phones’ screens, allowing us to interact with our devices in ways that don’t involve tapping keys and staring at pixels.

These technologies will have a considerable impact on mobile advertising business models, which depend on consumers eyes being focused on their screens to work, said Vlad Sejnoha, CTO of Nuance, the voice recognition and artificial intelligence (AI) company behind Siri. Apple’s new voice assistant hasn’t exactly eliminated the need for visual interaction with a smartphone, Sejnoha said, but its effects are already being felt on search portals, which traditionally act like as the middleman between consumers and the information they seek.

The most oft-cited example is “Siri, call me a cab.” Rather than perform the usual local search, displaying a list of taxi companies and word ads on a screen, Siri does all of the dirty work in the background, automatically placing a call to what its AI feels is the most relevant dispatcher. That bypass of the search portal is changing the way that cab companies present themselves on the Web, spawning a new type of SEO: ‘Siri’ engine optimization. Being in the top three listings of a local Google search or depending on AdWords to push your website to the top of sponsored results is no longer good enough, if Siri is doing the searching instead of the consumer.

Nuance’s own Dragon Go iOS app makes a similar end run around search portal. Dragon Go ingests a spoken search term such as “new Mexican restaurants” or “show times for Sherlock Holmes” and spits out websites displayed in a carousel, which the user can then thumb through. The websites themselves are still served up and thus are the ad impressions. But the portal middleman is eliminated, at least from the consumer’s perspective – Google’s engine may be used to power that search, but its customer-facing website or app never comes into the equation.

“Speech and natural-language search adds a new element to the UI,” Sejnoha said. “It can skip over steps you’d usually take to get information. It’s a shortcut between a user’s intent and a positive outcome. That’s a powerful concept.”

You ain’t seen, heard or felt nuthin’ yet

While voice assistants like Siri may seem revolutionary today, they’re actually rather simplistic compared to what UIs will be capable of as artificial intelligence and new multimodal means of interaction develop. Phones are already full of sensors that could be used to input information, Sejnoha said. If you don’t like the answer your phone is giving you, a simple shake of the handset could tell it to find you a better one. Rather than speak or type a search term into a user interface, you could point your phone’s camera at a bus stop sign, a restaurant or even a magazine ad, and within seconds have the relevant bus schedule, menu or product information displayed on your screen or dictated back to you.

The phone’s accelerometers could be used to power gesture-based commands — a flick to the right means you want to make a phone call, while a flick to the left initiates a Web search. Micro-projectors could even displace information you would normally need to consume from the screen. For example, instead of displaying a tiny map through LEDs and glass, the phone can paint an arrow on the ground showing you the direction you need to head.

As natural language interpretation and artificial intelligence technologies improve, the phone begins to comprehend subtleties in meaning and interpret intent, Sejnoha said. Nuance is working with IBM’s Watson program on deep question understanding and natural language processing. The hope for both companies is to create sophisticated network-based artificial intelligence that can power semantic searches.

Say you’re a hungry pedestrian walking the streets of New York and suddenly spot a new restaurant. You then make a gesture with your device to activate its voice assistant, point it at the restaurant and ask: “What rating did the New York Times give this joint?” GPS knows your precise location. The digital compass knows which direction you’re pointing. The UI then pings its AI servers in the cloud, which not only determine that you’re interested in a restaurant, but are specifically requesting the number of stars awarded it by a specific publication. The AI then scrapes that data off of NYT’s website and returns with a simple answer audio answer: “3 stars.”

In that scenario, the UI is already skipping over several websites and applications that a normal search would have used to find that 3-star rating: mapping apps, search engines and the New York Times website.  But what if that phone delivered more than just the basic data point requested? What if it correctly inferred you were looking to nosh and wanted help selecting a restaurant? It could search out other restaurant reviews from sites it knows you trust. It could scour the restaurant’s menu for dietary information. And it could poll your social networks to see if any of your friends had eaten there. The AI then contextualizes all of that info, decides what’s most relevant to you based on preferences and profile information, and spits out this answer:

Three stars, but Yelp gives it three and one half. Bob loves it. He says try the smoked-tea duck. Tell the waiter about your peanut allergy, though. The kitchen stir-fries with groundnut oil.

With that one semantic search, your device captures information from dozens of Websites, but by performing that search with a multimodal interface, at no point do you have to look at the phone’s screen. That also means at no point do you have to look at an ad. “Along the way value is derived from those brief stays on the portals through ad impressions,” Sejnoha said. “If you’re getting from A to B in fewer steps, you’re bypassing those tolls.”

What’s a Google to do?

These new interfaces put a lot of power into the hands of the UI creator, making them the new gatekeepers for information on the Web. That’s a huge benefit for big web brands that can strike up partnerships with the interface developer. For instance, Nuance works with Milo to deliver online-shopping results and Fandango for movie show times on its Dragon Go app. But since the focus of such interfaces is to deliver a single result rather than a variety of results, Internet companies without the desire or resources to partner with Apple or Nuance may find themselves marginalized.

Less choice may be exactly what consumers want, especially in the mobile world where people are using their phones to grab snippets of timely information rather conduct extensive research. According to Technology Business Research senior Analyst Ezra Gottheil, implementations of Siri are fixing what many consider a critical flaw in the user interface: the huge number of options a user is faced with on the average smartphone. By simplifying the UI with voice commands that yield immediate access to results, consumers are likely to use searches and applications more frequently, Gottheil said in a TBR research note.

As for the search portals, if Google or Bing don’t want to get cut out of the value chain, the easy answer is to implement multimodal search interfaces of their own. Both engines make powerful use artificial intelligence to refine their searches, while Google and Microsoft already accepts voice-prompted search terms. It wouldn’t be hard for them to project an ‘optimal’ search result beyond the confines of the screen – a voice activated and delivered version of Google’s “I’m feeling lucky” button.

But if Google and Bing are delivering direct results, rather than a list of links, what of their keyword advertising based business models? And what of the websites themselves that find their information hijacked by a multimodal UI without even an ad impression in compensation? Sejnoha thinks they just need to be more creative with their business models.

The screen is never going away, Sejnoha said. There are certain types of information that people will always need to absorb visually. It’s much easier to look at pie chart than have it described to you. And unless you’re driving, it’s easier to read a restaurant review rather than have your phone dictate it to you. But in the cases where information is more handily delivered off screen, companies need to find ways to draw their customers’ attention back to their devices, Sejnoha said.

Imagine in the restaurant example above that after your phone delivered its personalized and contextualized information about the restaurant in question, it added the following addendum: “By the way, I have some coupons for other restaurants in the neighborhood, if you’d care to look at my screen…”

Image courtesy Flickr user Lazurite

  1. Rorison Meadows Monday, December 19, 2011

    Do you forget that talking to your phone in public makes you an absolute douche?

    Share
    1. Who says it does? You?

      Share
    2. Appletards are well known as doucher than most people. Siri is another fail like Ping

      Share
      1. You wish.

        Share
  2. “The AI then contextualizes” ….
    Ever heard of Data finds Data, or context is organized data. No AI needed. For finding priorities, ehmm how about how any brain solves that, speed differential in data connections. Suddenly your system becomes sub-consciousness in its response, it’s also faster than the pixel sorters ever will be. Again no AI. As for Watson isn’t it just another way to break down data isolation by parallelization and integration, if the DBs where created without it no Watson necessary …

    Just saying.

    Share
  3. “At no point do you have to look at an ad.” But you might have to *listen* to an ad. It happens all the time on the radio.

    Share
    1. Yep, noticed that with Pandora a while ago, when I was surprised to hear a commercial. Then I wondered what took them so long. It’s not if, but when.

      Share
      1. Switch to Sirius then. Ad-free.

        Share
    2. Good point, Rich. I thought about mentioning Pandora and Spotify as examples of how ads are evolving, but I don’t think they really apply directly. In both those cases they’re the service provider and they’re shipping the audio ads. In the case of Siri or another UI, they’re scraping the Web for info and have no obligation, nor interest, in serving up someone else’s ads. Of course, there could be partnerships for this kind of thing. But the first time Siri starts speaking ads at you, you’re going to get a lot of upset iPhone users.

      Share
  4. “That popular search engine has such messy results, use Bing, the decision engine!”

    I wonder how that’s going.

    Share
  5. Android voice technology as it stands: when I’m in my car and I want to activate my phone’s voice technology to complete a text message, search my browser, or get directions, I first have to roll up all my windows, turn off the radio, heater, and windshield wipers, and tell every passenger in my car to shut up (all sources of noise interference). Explain how you do this in the middle of downtown New York. Now, what if I want privacy inside my car?? I would also have to tell all my passengers to plug their ears. And how do I guarantee that they did? When you figure these issues out, I’ll buy into your program.

    Share
    1. That’s a good point Dis. If Voice interfaces are really going to thrive phones will need better microphones that can isolate a speakers voice from background noise. I wonder if Apple made any improvements to the mic tech in the 4S?

      Share
  6. It’s fun to ponder, but the level of personal involvement needed to make “one answer” consistently the best one for each person, place and time is enormous.

    Siri is often cute, and semi-consistently useful, but the data and personalization effort feels for the scenarios you describe feel like a long road full of potholes.

    Share
    1. Only if you believe Journalist or People with no taste in math.
      Take a look at:
      http://gigaom.com/2011/12/20/nuance-buys-vlingo-builds-a-voice-technology-giant/#comment-779734

      How can we read the last sentence mostly as fast as unscrambled, can a machine do it without a large statistical data model?

      How about paralleling (TTL) then use differential timing to resolve sequence within letters, mostly as fast as unscrambled, works with any language and works on a phone/tablet. Can be used to read captchas or teach a machine math. and some more things, ok besides faster decryption of unknown encryption. Learning can be as personal as it gets without or because of no big data.

      Share
    2. Hi Perry, Yep lots of potholes. But I also think the capabilities of Siri would have been unimaginable a few years ago even if they aren’t all that spectacular (at least in a mass market consumer application). Still I would agree with you entirely if I didn’t think that majority of what consumers want to do could be boiled down to a few simple concepts: I want to eat, I want to go here, I want to tweet this. In my opinion, those all seem doable.

      Share
  7. RIchard Garrett Tuesday, December 20, 2011

    Siri, Nuance, Vlingo — they all seem to point to a ‘dumber’ smartphone. Could it be that we’ll go back to StarTac’s? I can see a time when that form factor coupled with a 7″ tablet is the mobile warrior’s arsenal of choice.

    Share
  8. And a five minute battery to go along with it.

    Share
  9. Siri is always ugly, and always useless as more and more people use devices other than Apple’s crappy trinkets

    Share

Comments have been disabled for this post