Google's New AI Can Mimic Human Speech Almost Perfectly

google assistant synthetic speech text-to-speech wavenet deepmind — *Image: StationaryTraveller/Getty Images*

Elocution Lessons

Last year, artificial intelligence (AI) research company DeepMind shared details on WaveNet, a deep neural network used to synthesize realistic human speech. Now, an improved version of the technology is being rolled out for use with Google Assistant.

A system for speech synthesis — otherwise known as text-to-speech (TTS) — typically utilizes one of two techniques.

Concatenative TTS involves the piecing together of chunks of recordings from a voice actor. The drawback of this method is that audio libraries must be replaced whenever upgrades or changes are made.

The other technique, parametric TTS, uses a set of parameters to produce computer-generated speech, but this speech can sometimes sound unnatural and robotic.

WaveNet, on the other hand, produces waveforms from scratch based on a system developed using a convolutional neural network.

To begin, a large number of speech samples were used to train the platform to synthesize voices, taking into account which waveforms sounded realistic and which did not. This gave the speech synthesizer the ability to produce natural intonation, even including details like lip smacks. Depending on the samples fed into the system, it would develop a unique “accent,” which means it could be used to create any number of distinct voices if fed different data sets.

Sharp Tongue

The biggest limitation of WaveNet was the fact that it initially required a significant amount of computing power and wasn’t very fast, needing one second to generate .02 seconds of audio.

After improving upon the system for the past 12 months, DeepMind’s engineers have optimized WaveNet to the point that it can now produce a raw waveform lasting one second in just 50 milliseconds — 1,000 times faster than the original. What’s more, the resolution of each sample has been increased from 8 bits to 16 bits, contributing to its higher scores in tests with human listeners.

These improvements mean the system can now be integrated into consumer products, like Google Assistant.

WaveNet is now being used to generate the U.S. English and Japanese voices for Google Assistant across all platforms. Because the system can create specialized voices based on whatever samples are fed into it, Google should be able to use WaveNet to synthesize realistic-sounding human speech for other languages and dialects moving forward.

Voice interfaces are becoming more and more prevalent across all forms of computing, but the stilted nature of some synthetic speech has put off many potential users. DeepMind’s efforts to improve upon this technology could prompt more widespread adoption and will certainly serve to refine the existing experience.