[Ars Technica has two stories about fast-evolving AI-based technologies that enable image and voice manipulation. The story below is about the latter – follow the links for the mentioned demonstrations and to create your own. The longer one is “With Stable Diffusion, You May Never Believe What You See Online Again. AI Image Synthesis Goes Open Source, with Big Implications” and includes these observations:
“Stable Diffusion and other models are already starting to take on dynamic video generation and manipulation, so expect photorealistic video generation via text prompts before too long. From there, it’s logical to extend these capabilities to audio and music, real-time video games, and 3D VR experiences. Soon, advanced AI may do most of the creative heavy lifting with just a few suggestions. Imagine unlimited entertainment generated in real-time, on demand. ‘I expect it to be fully multi-modal,’ said [creator Emad] Mostaque, ‘so you can create anything you can imagine, like the Star Trek holodeck experience.’ […]
Realistic image synthesis models are potentially dangerous for reasons already mentioned, such as the creation of propaganda or misinformation, tampering with history, accelerating political division, enabling character attacks and impersonation, and destroying the legal value of photo or video evidence. In the AI-powered future, how will we know if any remotely produced piece of media came from an actual camera, or if we are actually communicating with a real human?”
–Matthew]
With Koe Recast, you can change your voice as easily as your clothing
New AI tool alters the style and timbre of your voice, concealing your vocal identity.
By Benj Edwards
September 8, 2022
Thanks to a web demo of a new AI tool called Koe Recast, you can transform up to 20 seconds of your voice into different styles, including an anime character, a deep male narrator, an ASMR whisper, and more. It’s an eye-opening preview of a potential commercial product currently undergoing private alpha testing.
Koe Recast emerged recently from a Texas-based developer named Asara Near, who is working independently to develop a desktop app with the aim of allowing people to change their voices in real time through other apps like Zoom and Discord. “My goal is to help people express themselves in any way that makes them happier,” said Near in a brief interview with Ars.
Several demos on the Koe website show altered clips of Mark Zuckerberg talking about augmented reality with a female voice, a deep male narrator voice, and a high-pitched anime voice, all powered by Recast.
This kind of realistic AI-powered voice transformation technology isn’t new. Google made waves with similar tech in 2018, and audio deepfakes of celebrities have caused controversy for several years now. But seeing this capability in an independent startup funded by one person—”I’ve funded this project entirely by myself thus far,” Near said—shows how far AI vocal synthesis tech has come and perhaps hints at how close voice transformation might be to widespread adoption through a low-cost or open source release.
When asked what specific kind of AI powers Recast’s voice transformation under the hood, Near held back specifics but generalized how it works, “We’re able to dive in and alter the characteristics of voices within the embedding space that we’ve created. Our goal, then, is to modify the parts of audio that correspond to a speaker’s personal style or timbre while preserving the parts of the audio that correspond to the spoken content such as prosody and words. This allows us to change the style of someone’s voice to any other style, including their perceived gender, age, ethnicity, and so on.”
Recast supports 10 different voices, and more are on the way. “It’s currently undecided if we will be offering existing voices of celebrities or other well-known persons,” said Near.
Offering celebrity voices (or those imitating non-celebrity living persons) may pose ethical and legal questions, however. When asked about the potential misuse of Recast, Near replied, “As with any technology, it’s possible for there to be both positives and negatives, but I think the vast majority of humanity consists of wonderful people and will benefit greatly from this.” Near also pointed out that Recast includes a Terms of Service policy prohibiting illegal and hateful usage.
As for a release timeline, Near is pursuing commercial options but isn’t ruling out an open source release, which could potentially have an impact similar to Stable Diffusion by putting realistic audio deepfakes into the hands of many without hard restrictions. “We’re exploring some monetization strategies,” Near said. “If the profit models I have in mind don’t work out, open-sourcing this technology may be an option in the future.”
As deep learning technology continues to peel away the 20th century concept (or some might say “illusion”) of media as a fixed and accurate record of reality, we are looking at a near-future in which digital representations of a living human’s voice, much like images and video, will be one more thing you can’t take at face value without significant trust in the source. Still, the technology could empower many people who might otherwise be discriminated against while doing business—or simply having fun—online.
Leave a Reply