A study published last Monday, heralded as an historic achievement by Microsoft, details a new speech recognition technology that’s able to transcribe conversational speech as well as humans — or at least, as best as professional human transcriptionists (which is better than most humans).
The technology scored a word error rate (WER) of 5.9%, which was lower than the 6.3% WER reported just last month. “[I]t’s the lowest ever recorded against the industry standard Switchboard speech recognition task,” Microsoft reports. The rate is the same as (or even lower than) the human professional transcriptionists who transcribed the same conversation.
“We’ve reached human parity,” says Xuedong Huang, Microsoft’s chief speech scientist. The new technology uses neural language models that allow for more efficient generalization by grouping similar words together.
The achievement comes decades after speech pattern recognition was first studied in the 1970s. With Google’s DeepMind making waves in speech and image recognition (and speaking like humans do), the technology is Microsoft’s timely contribution to the fast-paced artificial intelligence (AI) research and development.
The achievement was unlocked using the Computational Network Toolkit, Microsoft’s homegrown system for deep learning.
The applications for the new technology are bound to improve user experience for Microsoft’s personal voice assistant for Windows and Xbox One. “This will make Cortana more powerful, making a truly intelligent assistant possible,” says an excited Harry Shum, the executive vice president heading the Microsoft Artificial Intelligence and Research group. Of course, it will also develop better speech-to-text transcription software.
Microsoft clarifies, however, that parity does not mean perfection. The computer did not recognize every word clearly, which is something not even humans could do perfectly (nor can Siri or other existing voice assistants).
Impressive as it is, there remains room for improvement. The next goal: making computers understand human conversation. “The next frontier is to move from recognition to understanding,” says Geoffrey Zweig, Speech & Dialog research group manager.