If you’ve ever been lost in the maze of Youtube videos you may have stumbled on clips of computers reading news articles. You’d recognize that staccato, robotic nature of the voice. We’ve come a long way from “Danger! Will Robinson!,” but it there is yet to be a computer that can seamlessly mimic a human voice.
Understanding voice samples has been powering programs like Google Voice Search for quite some time now. However, synthesizing something from those samples is proving to be quite a challenge. The most prominent method to do that right now is concatenative TTS (text-to-speech). It combines fragments of recorded speech together. The major drawback is this method can’t modify the fragments to create something new, resulting in the stilted “robotic” voice. Another method is parametric TTS, which passes speech through a vocoder, producing even less natural speech.
Google’s WaveNet uses a completely different approach. Instead of simply analyzing the audio its fed, it learns from them, similar to how many deep neural systems work. By working with at least 16,000 samples per second, WaveNet can generate its own raw audio samples.
And it can do this without much human intervention; it uses statistics to actually predict which audio piece it needs,what it has to “say” next.
Want to take a listen for yourself? The announcement post has several voice samples in both English and Mandarin Chinese. The system is also able to synthesize its own music, since it can analyse any sound patterns and not just speech. You can also listen to samples of the original compositions. Perhaps most impressively, the system is also able to synthesize speech without input. Where TTS always requires input as instruction, WaveNet is able to create speech sound without a road map. Granted, the result is just a string of nonsense sounds but it also contains the sounds of mouth movements and breathing. This indicates the exciting potential of the system to create the most realistic computer voices.