Baidu claims deep learning breakthrough with Deep Speech


Credit: Baidu

Chinese search engine giant Baidu says it has developed a speech recognition system, called Deep Speech, the likes of which has never been seen, especially in noisy environments. In restaurant settings and other loud places where other commercial speech recognition systems fail, the deep learning model proved accurate nearly 81 percent of the time.

That might not sound too great, but consider the alternative: commercial speech-recognition APIs against which Deep Speech was tested, including those for [company]Microsoft[/company] Bing, [company]Google[/company] and Wit.AI, topped out at nearly 65 percent accuracy in noisy environments. Those results probably underestimate the difference in accuracy, said [company]Baidu[/company] Chief Scientist Andrew Ng, who worked on Deep Speech along with colleagues at the company’s artificial intelligence lab in Palo Alto, California, because his team could only compare accuracy where the other systems all returned results rather than empty strings.


Ng said that while the research is still just research for now, Baidu is definitely considering integrating it into its speech-recognition software for smartphones and connected devices such as Baidu Eye. The company is also working on an Amazon Echo-like home appliance called CoolBox, and even a smart bike.

“Some of the applications we already know about would be much more awesome if speech worked in noisy environments,” Ng said.

Deep Speech also outperformed, by about 9 percent, top academic speech-recognition models on a popular dataset called Hub5’00. The system is based on a type of recurrent neural network, which are often used for speech recognition and text analysis. Ng credits much of the success to Baidu’s massive GPU-based deep learning infrastructure, as well as to the novel way them team built up a training set of 100,000 hours of speech data on which to train the system on noisy situations.

Baidu gathered about 7,000 hours of data on people speaking conversationally, and then synthesized a total of roughly 100,000 hours by fusing those files with files containing background noise. That was noise from a restaurant, a television, a cafeteria, and the inside of a car and a train. By contrast, the Hub5’00 dataset includes a total of 2,300 hours.

“This is a vast amount of data,” said Ng. ” … Most systems wouldn’t know what to do with that much speech data.”

Another big improvement, he said, came from using an end-to-end deep learning model on that huge dataset rather than using a standard, and computationally expensive, type of acoustic model. Traditional approaches will break recognition down into multiple steps, including one called speaker adaption, Ng explained, but “we just feed our algorithm a lot of data” and rely on it to learn everything it needs to. Accuracy aside, the Baidu approach also resulted in a dramatically reduced code base, he added.

You can hear Ng talk more about Baidu’s work in deep learning in this Gigaom Future of AI talk embedded below. That event also included a talk from Google speech recognition engineer Johan Schalkwyk. Deep learning will also play a prominent role at our upcoming Structure Data conference, where speakers from [company]Facebook[/company], [company]Yahoo[/company] and elsewhere will discuss how they do it and how it impacts their businesses.


Jeremy B Walls

Fusing distorted and normal speech is a great idea to extract normal speech. What if there was an algorithm that only recgonized pronunciation, and all the words were only translated in the end for better results, or if it were to guess ‘live’ words would it not go by ‘Theme’ ?!?!


i trust the video sound quality is not an indication of the product – sounds as if the speaker is in the canyon.

Deborah Dahl

There’s no question that improvements in basic speech recognition accuracy are valuable and welcome, but if the goal is natural interaction with machines, we need to get a lot further along in the state of the art of natural language understanding and dialog management.

Andrew Morris

From this article alone it is not clear to what extent the better performance of “Deep speech” is due to the end-to-end deep learning or the the extra large speech training database. From what I have heard I expect that would show only a small advantage to deep learning over other methods. Also, performance results comparing systems are not very meaningful unless each system is trained and tested on exactly the same data. Then there is also a problem with very large speech databases in that you usually need a separate one for each language, one for adults and one for young children, as well as for each noise level and noise type.


Well I hope they rerecorded those “fused” files from a speaker to a microphone because the digital representation of a clean vocal file would otherwise still be present for the system to analyze.

Comments are closed.