The Voice of AI

Using their DeepMind artificial intelligence (AI), Google’s Alphabet AI research lab developed a synthetic speech system called WaveNet back in 2016. The system runs on an artificial neural network that’s capable of speech samples at an ostensibly better quality than other technologies like it. The voice of AI is becoming more human-like, so to speak. WaveNet has since been improved to work well enough for Google Assistant across all platforms.

Now, WaveNet has gotten even better at sounding more human.

In a still-to-be-peer-reviewed paper published by Google in January 2018, WaveNet is getting a text-to-speech system called Tacotron 2. Effectively the second generation of Google’s synthetic speech AI, the new system combines the deep neural networks of Tacotron 2 with WaveNet.

First, Tacotron 2 translates text into a visual representation of audio frequencies over time, called a spectogram. This is then fed into WaveNet, which reads the spectogram and creates a chart with the corresponding audio elements.

According to the study, the “model achieves a mean opinion score (MOS) of 4.53 comparable to a MOS of 4.58 for professionally recorded speech.” Simply put, it sounds very much like a person speaking.

In fact, Google put recordings of a human and their new AI side-by-side, and it’s difficult to tell which is the person and which is the machine.

Here’s a sample.

Synthetic Speech System

To date, AI systems have gotten better at blurring the line between human and machine. There are now AIs capable of generating images of human beings that aren’t real, but look it. Another AI can even make fake videos. One can’t also discount the fact that some AIs are getting better at storytelling, or making art.

Mimicking human speech was always a challenge for AI networks. Now, DeepMind’s WaveNet and Tacotron 2 seem to be changing that, and at quite an impressive rate. Not only does the AI pronounce words clearly, but it seems to be able to handle difficult to pronounce words or names, as well as put emphasis on the appropriate words based on punctuations. Listen in.

This isn’t to say that the new AI system is perfect, however. Its current iteration has only been trained to use one voice, which Google recorded from a woman they hired. For the WaveNet and Tacotron 2 system to work using other voices — say a male’s or another female — the system would have to be trained again.

Apart from having immediate applications for Google Assistant, as soon as the Tacotron 2 system is perfected, the technology could assume other roles. Perhaps it could even take over certain jobs, adding to the already long list of occupations that seem ripe for AI.