Google’s voice-generating AI is now indistinguishable from humans

[It’s sometimes hard to remember how quickly the technologies behind machines that we speak to and that speak to us have developed, and how many of us are interacting with machines this way (e.g., see a recent Pew Research report). The short story below from Quartz and the audio samples it includes and links to demonstrate how good the technologies are becoming at evoking presence. –Matthew]

[Image: A spectrogram for “whoa.” Credit: Lorenzo Tlacaelel]

Google’s voice-generating AI is now indistinguishable from humans

By Dave Gershgorn
December 26, 2017

Humans have officially given their voice to machines.

A research paper published by Google this month—which has not been peer reviewed—details a text-to-speech system called Tacotron 2, which claims near-human accuracy at imitating audio of a person speaking from text.

The system is Google’s second official generation of the technology, which consists of two deep neural networks. The first network translates the text into a spectrogram (pdf), a visual way to represent audio frequencies over time. That spectrogram is then fed into WaveNet, a system from Alphabet’s AI research lab DeepMind, which reads the chart and generates the corresponding audio elements accordingly.

You can listen to two samples [in the original story]. Keep in mind one sample from each sentence is generated by AI, and the other is a human hired by Google. We don’t know for sure which is which. (However, if you reveal the “page source” and look at the filenames of each on the Google research website, one is labeled “gen,” ostensibly to mark the generated sample.)

The Google researchers also demonstrate that Tacotron 2 can handle hard-to-pronounce words and names, as well as alter the way it enunciates based on punctuation. For instance, capitalized words are stressed, as someone would do when indicating that specific word is an important part of a sentence.

[snip: Two more examples]

Unlike some core AI research the company does, this technology is immediately useful to Google. WaveNet, first announced in 2016, is now used to generate the voice in Google Assistant. Once readied for production, Tacotron 2 could be an even more powerful addition to the service.

However, the system is only trained to mimic the one female voice; to speak like a male or different female, Google would need to train the system again.


This entry was posted in Presence in the News. Bookmark the permalink. Trackbacks are closed, but you can post a comment.

Post a Comment

Your email is never published nor shared. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>


  • Find Researchers

    Use the links below to find researchers listed alphabetically by the first letter of their last name.

    A | B | C | D | E | F| G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z