Engineers at Facebook’s AI research lab created a machine learning system that can not only clone a person’s voice, but also their cadence — an uncanny ability they showed off by duplicating the voices of Bill Gates and other notable figures.
This system, dubbed Melnet, could lead to more realistic-sounding AI voice assistants or voice models, the kind used by people with speech impairments — but it could also make it even more difficult to discern between actual speech and audio deepfakes.
Text-to-speech computer systems aren’t particularly new, but in a paper published to the pre-print server arXiv, the Facebook researchers describe how Melnet differs from its predecessors.
While researchers trained many previous systems using audio waveforms, which chart how sound’s amplitude changes over time, the Facebook team used spectrograms, a format that is far more compact and informationally dense, according to the researchers.
The Facebook team used audio from TED Talks to train its system, and they share clips of it mimicking eight speakers, including Gates, on a GitHub website.
The speech is still somewhat robotic, but the voices are recognizable — and if researchers can smooth out the system even slightly, it’s conceivable that Melnet could fool the casual listener into thinking they’re hearing a public figure saying something they never actually uttered.
READ MORE: Facebook’s AI system can speak with Bill Gates’s voice [MIT Tech Review]