PASIEKA/Getty Images/Science Photo Library
Artificial Intelligence

At Last, Google’s DeepMind AI Can Make Machines Sound Like Humans

Its even able to change voices like an auditory chameleon.

Jelor GallegoSeptember 9th 2016

Generated Speech

If you’ve ever been lost in the maze of Youtube videos you may have stumbled on clips of computers reading news articles. You’d recognize that staccato, robotic nature of the voice. We’ve come a long way from “Danger! Will Robinson!,” but it there is yet to be a computer that can seamlessly mimic a human voice.

Now, there’s a new contender, brought to you by the brilliant minds behind DeepMind. Google has announced a new voice synthesis program in WaveNet, powered by deep neural AI.

Understanding voice samples has been powering programs like Google Voice Search for quite some time now. However, synthesizing something from those samples is proving to be quite a challenge. The most prominent method to do that right now is concatenative TTS (text-to-speech). It combines fragments of recorded speech together. The major drawback is this method can’t modify the fragments to create something new, resulting in the stilted “robotic” voice. Another method is parametric TTS, which passes speech through a vocoder, producing even less natural speech.

Learning to Speak

Google’s WaveNet uses a completely different approach. Instead of simply analyzing the audio its fed, it learns from them, similar to how many deep neural systems work. By working with at least 16,000 samples per second, WaveNet can generate its own raw audio samples.

Image credit: DeepMind

And it can do this without much human intervention; it uses statistics to actually predict which audio piece it needs,what it has to “say” next.

Want to take a listen for yourself? The announcement post has several voice samples in both English and Mandarin Chinese. The system is also able to synthesize its own music, since it can analyse any sound patterns and not just speech. You can also listen to samples of the original compositions. Perhaps most impressively, the system is also able to synthesize speech without input. Where TTS always requires input as instruction, WaveNet is able to create speech sound without a road map. Granted, the result is just a string of nonsense sounds but it also contains the sounds of mouth movements and breathing. This indicates the exciting potential of the system to create the most realistic computer voices.

Keep up. Subscribe to our daily newsletter.

I understand and agree that registration on or use of this site constitutes agreement to its User Agreement and Privacy Policy
Next Article