Google explains how more data means better speech recognition


A new research paper out of Google describes in some detail the data science behind the the company’s speech recognition applications, such as voice search and adding captions or tags to YouTube videos. And although the math might be beyond most people’s grasp, the concepts are not. The paper underscores why everyone is so excited about the prospect of “big data” and also how important it is to choose the right data set for the right job.

Google (s goog) has always been a fan of the idea that more data is better, as exemplified by Research Director Peter Norvig’s stance that, generally speaking, more data trumps better algorithms (see, e.g., his 2009 paper titled “The Unreasonable Effectiveness of Data“). Although some hair-splitting does occur about the relative value (or lack thereof) of algorithms in Norvig’s assessment, it’s pretty much an accepted truth at this point and drives much of the discussion around big data. The more data your models have from which to learn, the more accurate they become — even if they weren’t cutting-edge stuff to begin with.

No surprise, then, it turns out that more data is also better for training speech-recognition systems. The researchers found that data sets and larger language models (here’s a Wikipedia explanation of the n-gram type involved in Google’s research) result in fewer errors predicting the next word based on the words that precede it. Discussing the research in a blog post on Wednesday, Google research scientist Ciprian Chelba gives the example that a good model will attribute a higher probability to “pizza” as the next word than to “granola” if the previous two words were “New York.” When it comes to voice search, his team found that “increasing the model size by two orders of magnitude reduces the [word error rate] by 10% relative.”

The real key, however — as any data scientist will tell you — is knowing what type of data is best to train your models, whatever they are. For the voice search tests, the Google researchers used 230 billion words that came from “a random sample of anonymized queries from that did not trigger spelling correction.” However, because people speak and write prose differently than they type searches, the YouTube models were fed data from transcriptions of news broadcasts and large web crawls.

“As far as language modeling is concerned, the variety of topics and speaking styles makes a language model built from a web crawl a very attractive choice,” they write.

This research isn’t necessarily groundbreaking, but helps drive home the reasons that topics such as big data and data science get so much attention these days. As consumers demand ever smarter applications and more frictionless user experiences, every last piece of data and every decision about how to analyze it matters.

Feature image courtesy of Shutterstock user watcharakun.



Marek’s rudeness aside, I agree that the article was lite fare.
Opinion: I believe our young people (and some fuddies) are swallowing the tech pill and the Big Brother pill whole. I’m tiring of the re-inventing of the telephone, the car and the train. And Google’s and Facebook’s collection of personal data is scary.

Derrick Harris

Privacy concerns aside, this post really just tries to demonstrate the principle that having more data helps enable these advances in analysis.

Even if the models have been around for a while, the fact that Google can pull billions of words and phrases from its own data sets — and actually choose the most meaningful data set for any given situation — is significant to understanding how it does speech recognition.


Sometimes old techniques are forgotten, and all for naught.

A simple use case that these systems still seem to fail is looking up a contact phone number by voice.

Soundex was a simple algorithm for finding matches used by 411 services. It works especially well when doing speech recognition against a limited domain to overcome limitations in determing what word is said in a critical part of a search.

I have had so many “fails” when I try to say “call XXX” and the voice recognition can’t find XXX in my contacts list. Names are an especially hard voice recognition problem because of the variety of pronounciations.

Yet, it is evident to me that these names could easily (and were obviously the only) match using Soundex, or a variation of that. If there are more than one match, just give a list of choices instead of “can’t find”.

Really big data is nice, but context and use case matters. Optimizing the most common use case/context-specific applications with some different learning/context-specific algorithms can help tremendously in improving this stuff too.

A few use cases done really well would make me happier than my current experience.

My two cents.


If you get rid of all the business BS, and employed some common sense, you could do without this article. Speech recognition involves finding statistical trends. The more good data you have, the more confidence you have and a clearer picture you have of the trends. Language as socially determined is inherently statistical.The “mystery” isn’t why more data is better. The burden is in the technical implementation and finding out which trends are noteworthy or what they tell us. The word “big data” sickens me.

Apparently this author has nothing better to do than write articles of no value, that take an obvious banality, make a big deal out of it, and then fail in even explaining what he promised to do. I am reminded of ‘Requiem for a Dream’: We got a winner! OoooOOOOOOOOH! Be exited! Be, be exited!


Your links need attention; they are not working as they should


not impressed with google’s voice search. it doesn’t work anything like the tv commercial or the recent blog posts. google fan boys can rejoice in the better than average speech to text, but the results aren’t so hot.


I disagree. I have been using Google voice search for years and at first there was a learning curve but Google voice has honed in and now I get very quick and accurate searches and direction. I literally speak and then it goes directly where I was speaking about on web or maps. It takes time for voice to fine tune the way you speak.


“everyone” isn’t “excited about big data” .. collecting user data without permission is a part of google’s arrogant ignorance. what google’s voice search has is speedy. however my testing today – directions to Costco google answered ‘here’s a map to Boston”. about 1/2 of the queries i posed went to the wrong interpreted results.

Jack N Fran Farrell

It also explains why MS and Apple services (finding what you say you want) have some catching up to do.

Comments are closed.