Giving Machines a Voice

Last year, Google successfully gave a machine the ability to generate human-like speech through its voice synthesis program called WaveNet. Powered by Google’s DeepMind artificial intelligence (AI) deep neural network, WaveNet produced synthetic speech using given texts. Now, Chinese internet search company Baidu has developed the most advanced speech synthesis program ever, and it’s called Deep Voice.

Developed in Baidu’s AI research lab based in Silicon Valley, Deep Voice presents a big breakthrough in speech synthesis technology by largely doing away with the behind-the-scenes fine-tuning typically necessary for such programs. As such, Deep Voice can learn how to talk in a matter of a few hours and with virtually no help from humans.

Deep Voice uses a relatively simple method: through deep-learning techniques, Deep Voice broke down texts into phonemes — which is sound at its smallest perceptually distinct units. A speech synthesis network then reproduced these sounds. The need for any fine-tuning was greatly reduced because every stage of the process relied on deep-learning techniques — all researches needed to do was train the algorithm.

humanoid-robots
CLICK HERE TO VIEW FULL INFOGRAPHIC

“For the audio synthesis model, we implement a variant of WaveNet that requires fewer parameters and trains faster than the original,” the Baidu researchers wrote in a study published online. “By using a neural network for each component, our system is simpler and more flexible than traditional text-to-speech systems, where each component requires laborious feature engineering and extensive domain expertise.”

Towards Real-Time Conversion

Text-to-speech systems aren’t entirely new. They’re present in many of the world’s modern gadgets and devices. From simpler ones — like talking clocks and answering systems in phones — to more complex versions, like those in navigation apps. These, however, have been made using large databases of speech recordings. As such, the speech generated by these traditional text-to-speech systems don’t flow as seamless as actual human speech.

Baidu’s work on Deep Voice is a step towards achieving human-like speech synthesis in real time, without using pre-recorded responses. Baidu’s Deep Voice puts together phonemes in such a way that it sounds like actual human speech. “We optimize inference to faster-than-real-time speeds, showing that these techniques can be applied to generate audio in real-time in a streaming fashion,” their researchers said.

However, there are still certain variables that their new system cannot yet control: the stresses on phonemes and the duration and natural frequency of each sound. Once perfected, control of these variables would allow Baidu to change the voice of the speaker and, possibly, the emotions conveyed by a word.

At the very least, this would be computationally demanding, limiting just how much Deep Voice can be used in real-time speech synthesis in the real world. As the the Baidu researchers explained:

“To perform inference at real-time, we must take great care to never recompute any results, store the entire model in the processor cache (as opposed to main memory), and optimally utilize the available computational units.” 

In the future, better synthesized speech systems can be used to improve the assistant features found in smartphones and smart home devices. At the very least, it would make talking to your devices feel more real.