Computer scientists and electrical engineers from the University of California discover a way to produce more expressive AI-generated voices with minimal training.
Researchers have uncovered a means to make AI-generated voices, such as digital personal assistants, more expressive, and organic with a minimum amount of training. The method, which translates text to speech, can also be applied to voices that were never part of the system’s training set.
The team of computer scientists and electrical engineers from the University of California San Diego presented their work at the ACML 2021 conference.
The possible application of this new technology
In addition to personal assistants for smartphones, homes and cars, the method could also help to improve voice-overs in animated movies and automatic translation of speech in multiple languages. The method could also help establish personalised speech interfaces that empower individuals who have lost the ability to speak, similar to the computerised voice that Stephen Hawking used to communicate, yet more organic and expressive.
“We have been working in this area for a fairly long period of time,” said Shehzeen Hussain, a PhD student at the UC San Diego Jacobs School of Engineering and one of the paper’s lead authors. “We wanted to look at the challenge of not just synthesising speech but of adding expressive meaning to that speech.”
Shortcomings of existing AI-generated voice methods
Existing AI-generated voice methods fall short of this work in two ways; some systems can synthesise expressive speech for a specific speaker by utilising several hours of training data for that speaker, while others can synthesise speech from only a few minutes of speech data from a speaker never encountered before. However, they are not able to generate expressive speech and only translate text to speech.
By contrast, the method developed by the UC San Diego team is the only one that can generate with minimal training expressive speech for a subject that has not been part of its training set.
New method to add expressive meaning to synthesising speech
The researchers identified the pitch and rhythm of the speech in training samples, as a proxy for emotion. This allowed their cloning system to generate expressive speech with minimal training, even for voices it had never encountered before.
“We demonstrate that our proposed model can make a new voice express, emote, sing or copy the style of a given reference speech,” the researchers explained.
Their method can understand speech directly from text; reconstruct a speech sample from a target speaker; and transfer the pitch and rhythm of speech from a different expressive speaker into cloned speech intended for the target speaker.
Possible threats of this new technology
The team is aware that their AI-generated voice work could be utilised to make deep-fake videos and audio clips even more realistic and persuasive. As a result, they plan to release their code with a watermark that will identify the speech as cloned if created by their method.
“Expressive voice cloning would become a threat if you could make natural intonations,” concluded Paarth Neekhara, the paper’s other lead author and a Ph.D. student in computer science at the Jacobs School. “The more important challenge to address is detection of these media and we will be focusing on that next.”
The method itself still needs to be enhanced and developed. It is biased toward English speakers and struggles to identify voices with a strong accent.