Microsoft's Speech Recognition is Now as Good as a Human Transcriber

conversational speech recognition speech and dialog research neural net-based acoustic models — *Image: Adobe Stock*

Incredible Accuracy

Microsoft recently announced that its conversational system for speech recognition has achieved a 5.1 percent error rate, its best performance to date. This beats the 5.9 percent error rate achieved in October of 2016 and put its accuracy at the same level as professional human transcribers, who can listen to text multiple times, access cultural context, and collaborate with other transcribers.

After the 2016 study, other researchers set the human parity rate at a 5.1 percent error rate. Therefore, even using the more conservative standard, the system has achieved human parity.

The recordings that formed the basis of both studies came from the Switchboard collection, a research collection of thousands of telephone conversations used to test speech recognition systems since the early 1990s. The most recent study was conducted by a team from Microsoft AI and Research with the aim of improving accuracy and achieving human parity, even despite human advantages such as ability to cooperate and make use of context and experience.

The Whole Picture

Researchers in this study reduced the error rate by around 12 percent, primarily by improving the language and neural net-based acoustic models of Microsoft’s speech recognition system. Significantly, they also enabled the system’s speech recognizer to make use of entire conversations instead of just snippets, which allowed it to more ably predict what phrases or words would most probably come next.

This also allowed the system to more successfully adapt its transcriptions to context, just as humans do naturally in conversation. In other words, the researchers taught the system to more capably take in the whole picture when working to understand what it was hearing. Microsoft’s speech recognition system is used right now in Cortana, Microsoft Cognitive Services, and Presentation Translator. In the future, human-like speech recognition software will be essential to creating AI that humans can interact and work with as easily as they would a human collaborator.

A 5.1 percent word error rate for the speech recognition is an important accomplishment, but many challenges remain for the speech research community. According to Microsoft technical fellow Xuedong Huang, using distant microphones to achieve human levels of recognition in noisy environments, achieving higher levels of recognition for accented speech, and recognizing languages and speaking styles using only limited training data are still more distant goals.

Moreover, taking this technology beyond transcribing and into deeper comprehension — such as understanding of intent and meaning — is another goal, and the next major frontier for speech technology and artificial intelligence.