New Microsoft AI Can Clone Your Voice From Three Seconds of Audio

Microsoft has a new text-to-speech AI that can clone your voice, tone and all, from just a quick three-second snippet of audio. It's called VALL-E. — *Image: Getty / Futurism*

VALL-E Parking

Microsoft says its new text-to-speech AI can clone your voice, tone and all, from a three-second snippet of audio. It’s called VALL-E, and we have mixed feelings.

The underlying tech behind the system, which Microsoft refers to in a new paper as a “neural codec language model,” is complex — but in practice, using the system appears to be wildly simple. Plug in an audio sample, then some text, and voilà: real-sounding speech.

Of course, many text-to-speech apps already exist. Most news sites, us included, for example offer machine-powered dictation services, while speaking assistants like Siri and Alexa are hugely popular.

Most existing speech-generating programs, however, require a large amount of input. They also haven’t exactly figured out how to make AI voices sound particularly human, mostly due to the fact that emotional tone and tiny inflections are incredibly complex to convey.

If Microsoft’s system really can deliver on the tone piece, with that little required on the input side? That’s a big deal.

Mixed Feelings

According to its creators, VALL-E has a number of applications, including “zero-shot TTS, speech editing, and content creation,” adding that OpenAI’s GPT-3 language modeling system — a technology that Microsoft, per its absolutely massive investment into OpenAI, has put a ton of resources into and is already working into several products — would be a particularly useful piece of tech to combine with the new speech generator as a means of churning out content.

And if the latter is something you might be into, Microsoft does have a point. Theoretically, by combining VALL-E and GPT-3 — two powerful pieces of AI-driven tech — you could patch together a ton of real-sounding, believable content, incredibly quickly.

But that, of course, is where some ethically-tricky hypotheticals enter the picture.

Fake and misleading sound bytes are obviously a concern here — after all, if you only need three seconds of audio, you could theoretically use anything from a celebrity interview to a real person’s Instagram story to impersonate someone.

That said, Microsoft was careful to address that concern, explaining that it’s refraining — at least for now — from making the code open source due to “potential risks in misuse of the model.” They also claim that they’re working on incorporating some kind of system that detects whether audio was created using VALL-E, buuuuuut maybe they should ask their friends over at OpenAI how easy that really is.

More on Microsoft <3 AI: Microsoft Working on Deal to Add OpenAI’s GPT into MS Word