Realistic voice illusions: Two startups create and alter voices by using neural networks

[Two recent stories highlight efforts to create more effective presence illusions based on the human voice; the first is from GeekWire, where the original version includes three 3 more videos that compare the WellSaid technology to reading by voice actors. The second, a Digital Trends story about Modulate’s “voice skins,” follows below and also includes a demonstration video. –Matthew]

[Image: A screenshot illustrates how WellSaid’s voice synthesis platform could be used. Credit: WellSaid Illustration]

AI2 gives birth to WellSaid, a startup that synthesizes amazingly realistic voices

By Alan Boyle
March 7, 2019

We’ve got Apple’s Siri, Microsoft’s Cortana, Amazon’s Alexa and Google Assistant — so do we really need more synthesized voices to do our bidding?

Absolutely, say the founders of WellSaid Labs, a startup that’s being spun out from Seattle’s Allen Institute for Artificial Intelligence (also known as AI2).

“We’re just solving a different problem,” co-founder and chief technology officer Michael Petrochuk told GeekWire. “Alexa and Google Home are trying to solve the problem of clearly, slowly communicating — pronouncing everything the same way, in a monotone format so it could be understood by everyone.”

WellSaid, in contrast, is developing a stable of AI-powered voices customized for different context, and sounding so lifelike that you wouldn’t believe they’re robots. During a recent video demonstration for a roomful of AI aficionados, most folks guessed that the images were generated by an algorithm, but not the voices: [watch the 1:03 minute video in the original story or via YouTube]

“Our voices sound different each time,” Petrochuk said. “They always interpret the sentence differently, and they can be used in a video, or an audiobook, without making you fall asleep.”

The venture grew from the work that was being done by Petrochuk and WellSaid’s other founder and CEO, Matt Hocking, under the aegis of AI2’s startup incubator. Now the technology is ready for its public reveal, and the two AI researchers are raising seed funding and seeking partners.

“We’re looking to partner with people who are looking to sell content production with voice, and also the next generation of voice experiences,” Hocking said. “We’re actively looking for people to explore opportunities.”

The technology could be applied to a wide range of opportunities: For example, a video game known as Red Dead Resumption 2 required the services of 700 voice actors. Theoretically, WellSaid could offer a huge catalog of synthesized voices to do the same job with AI.

WellSaid’s software platform could also spice up audiobooks, offer customized voice assistants or give companies “branded voices” that become part of their enduring image. Veteran announcer Don Pardo may no longer be with us, but his synthesized voice could continue to introduce “Saturday Night Live” for decades to come.

For those who have lost their ability to speak due to accident or illness, WellSaid could provide a synthesized voice with a natural lilt rather than the robotic monotone that became the trademark of the late physicist Stephen Hawking.

Hocking compared the concept to the use of stock images, stock video and stock music in creative productions. Now there’ll be stock voices.

“Anything which is written can now be voiced,” Hocking said.

Petrochuk and Hocking are very aware of the potential pitfalls associated with super-realistic synthetic voices. Deep-fake videos — such as a viral clip in which former President Barack Obama appears to make crazy statements like “Ben Carson is in the sunken place” — already show how the line between reality and fakery can be blurred beyond recognition [watch the 1:12 minute video in the original story or via YouTube]

“That’s just not a direction that our company wants to head in,” Petrochuk said. “Our focus is on allowing creators to create with voice, and we’re focusing on building a product for the common good, per AI2’s mission. With that, we have to recognize some possible negative implications of this technology.”

Petrochuk said WellSaid won’t allow anyone to create a voice. “All we’re doing is, we’re opening up a library of curated voices, with the appropriate cautions to make sure those voices aren’t used in a negative light,” he said.

WellSaid’s voices are generated by recording text spoken by voice actors who have given their consent, and then putting it through an algorithm that captures the voice’s natural-sounding “fingerprint.” That voice can then be used to speak any text entered into WellSaid’s software program, with appropriate tweaks to convey emotional content.

Won’t WellSaid’s stable of synthesized voices put actors out of business?

“At the moment, we’re working on the core technology, but we definitely do see a business model where you can look at a voice actor and liken it to a photographer,” Hocking said. “A voice actor could potentially have a synthetic version of their voice which they may be able to license out for larger-volume, lower-quality projects — but then do work on the high-end movie or television commercial that truly needs to be acted.”

The flip side is that the software can literally give voice to the voiceless.

“The positives far outweigh the negatives,” Hocking said. “You look at CGI, you look at existing technology, and it’s inevitable that voice is going to be a part of that. The applications that we’re focused on, and the way they’ll empower people who have trouble speaking, or can’t speak, or need access to voice in order to produce something valuable, is what we’re focused on. … We’re focused on bringing this amazing technology to the people who need it most.”

—
[From Digital Trends, where the original version includes a 0:45 minute demonstration video (also available via YouTube)]

[Image: Modulate founders Carter Huffman and Michael Pappas]

Modulate wants to bring voice skins to your favorite online games

New technology called ‘voice skins’ could let you change how you sound in games

Gabe Gurwin
February 27, 2019

Over the last decade, online multiplayer games have turned into more personal experiences. Call of Duty: Black Ops 4, Fortnite, Anthem and more are offering new ways for players to enhance their characters. From outfits to weapon skins and even emotes, players can adorn their avatars in a way that reflects their taste and personal style.

Missing from these games, however, is the ability to change your voice. Modulate, a computer software company co-founded by Mike Pappas and Carter Huffman, aims to address this with a technology called “voice skins” which allows you to change your voice on the fly.

Using deep neural networks and machine learning, Modulate allows you to customize your voice. You can choose to sound like the opposite gender, a celebrity or even create your own custom voice. Your emotion and cadence will remain the same, with Modulate giving you full control over how your vocal cords will be used.

According to CTO Carter Huffman, he became interested in the potential for voice skin technology around 2015 after trying photo editing apps such as Prisma. These have the potential to drastically edit existing photos and make them look like other famous works of art.

Huffman realized that there was potential for this kind of technology to find a home in audio. It took him about a year to get results during his experimentation phase, eventually finding that adversarial technology made the process easier.

“This is something that people have wanted for 100 years,” CEO Pappas said. “It has shown up in sci-fi, in games, and in stories all over the place – as something that, obviously, we should be developing.”

Modulate works by having one neural network listen to a user’s voice and then try to produce something, which is then examined by a second adversarial neural network. That network then determines whether or not the voice produced is doing what it aimed to do. For example, in order to make the voice skin for Barack Obama sound like him, the adversarial network was given clips of his speeches so it could better understand his voice.

The process is iterative, with the adversarial network identifying specific parts of the voice skin’s audio that don’t sound correct. If a voice is the wrong pitch, for instance, this will be corrected, and the voice skin network will not make this mistake on its next try.

“Eventually, it outputs speech that the adversary cannot tell the difference between the voice skin’s output and real Barack Obama. And if the adversary is really good, then we also cannot tell the difference,” Huffman added.

The goal, however, is not for you to impersonate another person. Modulate uses a digital watermark that computer programs can detect that will alert them of someone making use of a voice skin. The plan is for the technology to be directly implemented into other programs rather than used on its own. This should make voice fraud during phone calls impossible, and you will not be able to impersonate well-known voice actors in order to make a reel for your own work.

If Modulate was used in a large game like Fortnite, it would likely be built natively into the application. Pappas also clarified that certain companies develop the voice chat systems for multiple games, and Modulate could work with them to implement it across several supported games as well. The technology would allow for the games to point out which users are using Modulate but the company will ultimately leave it up to the game’s developers to determine whether or not they’ll make use of it.

Pappas and Huffman want Modulate to be used for players to better express themselves in their favorite games. If the skin you happen to be wearing is of something menacing or makes your avatar look intimidating, Modulate could more easily portray this. Likewise, for those self-conscious about their own voices, the technology would allow them to communicate with others more comfortably.

In the immediate future, Modulate plans on continuing its pilot program which is integrating and testing the technology into existing chat platforms and games. As the company continues to grow, it aims to add additional features such as changes to your accent. Pappas believes it could have applications outside of just video game chat. Since the technology is affecting tonality rather than the words themselves, it would be easily applicable across multiple languages, as well.

“We’re starting in the gaming space, but we really see this as a fundamentally required technology in order for you to use voice chat, and everyone’s going to use voice chat,” he said.

Huffman noted that with virtual reality technology becoming more lifelike, voice skins could make the experience even more immersive. As of now, your options are limited.

“It’s the Ready Player One dream, right?” Huffman said. “You’re inhabiting this character, and then you speak, and it’s just your voice, or maybe a Darth Vader voice. But you can’t convincingly be the rest of that character that you want to be.”

Modulate would certainly give game developers more options for in-game goodies. Alongside the latest costume, games could offer voice skins as rewards for high-level play. The possibilities are nearly endless, and we’ll likely see the first fruits of the team’s labor later this year. The plan is for Modulate to be integrated into existing games by the end of 2019, and possibly within the next six months.

If you’d like to hear Modulate’s technology in real-time, you can try a demonstration on the company’s website. Multiple sliders let you make fine adjustments to your recording, and the results are both impressive and hilarious. As more users try out Modulate and it learns more about sounds (such as laughter) the neural networks will go on to create a more polished and believable product.

International Society
for Presence Research

Realistic voice illusions: Two startups create and alter voices by using neural networks

Comments

Leave a Reply Cancel reply

ISPR Presence News

Search ISPR Presence News:

Categories

Archives

Recent Posts

Recent Comments