Meta’s Voicebox AI increases potential benefits and dangers of artificial audio

[A new and improved tool from Meta called Voicebox AI that generates and modifies human-sounding voices could help those with vocal cord damage speak, produce natural sounding translations across many languages, make non-player gaming characters (NPCs) and digital assistants more convincing, and allow us to listen to written messages in the voice of the sender. Unfortunately, it also has significant potential for abuse (e.g., in deepfake scams). The story below from Engadget provides some of the details and for more information see the Meta AI blog (and go to the bottom for links to the research paper and additional demos). For more on the concerns about Voicebox AI see coverage from Decrypt and ExtrememTech. –Matthew]

Meta’s Voicebox AI is a Dall-E for text-to-speech

But the company won’t be sharing the app or its source code for the time being.

By Andrew Tarantola
June 16, 2023

Today, we are one step closer to the immortal celebrity future we have long been promised (since April). Meta has unveiled Voicebox, its generative text-to-speech model that promises to do for the spoken word what ChatGPT and Dall-E, respectfully, did for text and image generation.

Essentially, its a text-to-output generator just like GPT or Dall-E — just instead of creating prose or pretty pictures, it spits out audio clips. Meta defines the system as “a non-autoregressive flow-matching model trained to infill speech, given audio context and text.” It’s been trained on more than 50,000 hours of unfiltered audio. Specifically, Meta used recorded speech and transcripts from a bunch of public domain audiobooks written in English, French, Spanish, German, Polish, and Portuguese.

That diverse data set allows the system to generate more conversational sounding speech, regardless of the languages spoken by each party, according to the researchers. “Our results show that speech recognition models trained on Voicebox-generated synthetic speech perform almost as well as models trained on real speech.” What’s more the computer generated speech performed with just a 1 percent error rate degradation, compared to the 45 to 70 percent drop-off seen with existing TTS models.

The system was first taught to predict speech segments based on the segments around them as well as the passage’s transcript. “Having learned to infill speech from context, the model can then apply this across speech generation tasks, including generating portions in the middle of an audio recording without having to recreate the entire input,” the Meta researchers explained.

Voicebox is also reportedly capable of actively editing audio clips, eliminating noise from the speech and even replacing misspoken words. “A person could identify which raw segment of the speech is corrupted by noise (like a dog barking), crop it, and instruct the model to regenerate that segment,” the researchers said, much like using image-editing software to clean up photographs.

Text-to-Speech generators have been around for a minute — they’re how your parents’ TomToms were able to give dodgy driving directions in Morgan Freeman’s voice. Modern iterations like Speechify or Elevenlab’s Prime Voice AI are far more capable but they still largely require mountains of source material in order to properly mimic their subject — and then another mountain of different data for every. single. other. subject you want it trained on.

Voicebox doesn’t, thanks to a novel new zero-shot text-to-speech training method Meta calls Flow Matching. The benchmark results aren’t even close as Meta’s AI reportedly outperformed the current state of the art both in intelligibility (a 1.9 percent word error rate vs 5.9 percent) and “audio similarity” (a composite score of 0.681 to the SOA’s 0.580), all while operating as much as 20 times faster than today’s best TTS systems.

But don’t get your celebrity navigators lined up just yet, neither the Voicebox app nor its source code is being released to the public at this time, Meta confirmed on Friday, citing “the potential risks of misuse” despite the “many exciting use cases for generative speech models.” Instead, the company released a series of audio examples (see [the Meta AI blog]) as well as the program’s initial research paper. In the future, the research team hopes the technology will find its way into prosthetics for patients with vocal cord damage, in-game NPCs and digital assistants.


Comments


Leave a Reply

Your email address will not be published. Required fields are marked *

ISPR Presence News

Search ISPR Presence News:



Archives