Microsoft’s new AI tool can mimic a human voice with 3 seconds of audio

Vall-E, Microsoft, AI

Microsoft’s new voice AI tool Vall-E can mimic anyone’s voice by analyzing a three-second audio sample. It also tries to preserve the speaker’s emotional tone. Vall-E can be used for text-to-speech applications and speech editing.

Vall-E has built a technology called EnCodec that analyses a person’s voice and uses it to train understand how the voice would sound in speaking different phrases. Even with a three seconds sample clip, this tool can replicate the speaker’s timbre and tone. Now think if you provide a large sample of recording, this tool can create more realistic audio results.

Vall-E

This tool has been trained on 60,000 hours of English speech data from more than 7,000 speakers. Microsoft also provided dozens of audio examples of the AI model in action. You can also check audio examples here.

“VALL-E emerges in-context learning capabilities and can be used to synthesize high-quality personalized speech with only a 3-second enrolled recording of an unseen speaker as an acoustic prompt. Experiment results show that VALL-E significantly outperforms the state-of-the-art zero-shot TTS system in terms of speech naturalness and speaker similarity,” describes the official website of VALL-E.

VALL-E can be used in the production industry and has several use cases. This can help voice-over artists who can now do their job effortlessly.

But it can also be used for wrong things. It can be used to make fake call recordings of politicians and celebrities. It is also a security threat in applications that use voice passwords.

Share this article
Shareable URL
Prev Post

Toddle raises $17M in Series A led by Sequoia Capital India

Next Post

OpenAI to introduce the Premium Version of ChatGPT soon

Leave a Reply
Read next
Subscribe to our newsletter
Get notified of the best deals on our WordPress themes.
1
Share