Research
3 minute read

Converting several audio streams into one voice makes it easier for AI to learn

IBM researchers showed that putting the words of multiple speakers into one voice helps AI models pick up the nuances of spoken language. Their algorithm builds on a popular foundation model for speech processing, improving its accuracy and continuing IBM’s efforts to make AI accessible to everyone.

IBM RESEARCH_BLOG_008-10 (1).png

IBM researchers showed that putting the words of multiple speakers into one voice helps AI models pick up the nuances of spoken language. Their algorithm builds on a popular foundation model for speech processing, improving its accuracy and continuing IBM’s efforts to make AI accessible to everyone.

If you’ve ever asked a voice assistant a question and marveled at how badly it mangled what you just said, you’ve encountered the limits of AI speech recognition. The problem has to do with how people speak. While written language is standardized and predictable, the way we express ourselves verbally is highly irregular. It’s this individual variation — in accent, rhythm, and intonation — that gives speech-processing bots so much trouble.

“We each have a unique style of speaking,” said Yang Zhang, a researcher at the MIT-IBM Watson AI Lab working on AI speech processing. “Most of the time we understand each other just fine, but speech recognition systems get confused when they hear the same word spoken in slightly different ways. If you can deliver those words in one voice, we show the system can learn to recognize those subtle variations as one word, and not several.”

In a study recently presented at this year’s International Conference on Machine Learning (ICML), Zhang and his colleagues showed that adding a voice-conversion algorithm to the cutting-edge speech-processing model, HuBERT, improved the model’s performance on a key range of tasks, from identifying a speaker’s intent to flagging gibberish in samples of recorded speech.

After years of halting progress, speech recognition is having its moment. The shift came two years ago with the release of HuBERT, Facebook’s pre-trained model for analyzing speech, that was itself patterned after Google’s text-analysis model, BERT.

Goat

Southern

BERT was the first language model to take a firehose of raw, unlabeled text and figure out its hidden structures with no explicit instruction. The model could then be trained on a small set of labeled examples to learn a downstream task, saving time, money, and energy.

Both BERT and HuBERT play a fill-in-the-blank guessing game with the words in a sentence to understand how they relate to each other. That holistic view of language allows the models to extract more context as they churn through reams of raw text or hours of recorded speech.

Once they have a grammar-like foundation, pre-trained foundation models like BERT and HuBERT need relatively few labeled examples to learn new language skills. Previously, it might have taken a few hundred hours to train a language model to summarize something like a collection of legal briefs or identify keywords in customer voice recordings, according to Zhang. Today, they can do it in 10 minutes.

“And they perform just as well,” he added.

Still, speech-processing holds challenges that don’t apply to written text. Disentangling the identity of multiple speakers comes naturally to humans, but machines struggle without a transcript of the conversation to compare spoken words to their text equivalents to clear up ambiguities.

Zhang is an expert on voice conversion, or transferring one person’s speaking style to another. He wondered if dubbing one voice over the others would simplify matters. The words, or content, would stay the same, and so would each speaker’s distinctive speaking style. But if the model thought that one person was saying each person’s lines, could it learn to recognize the acoustic variability in all spoken words?

To test the idea, Zhang and his colleagues combined the HuBERT foundation model with AutoVC, an updated voice conversion model that Zhang and his colleagues released two years ago. They renamed this expanded foundation model for speech recognition ContentVec.

Finding a voice to speak everyone’s lines was no easy task. It had to be a voice that was easily cloned and adapted to the speaking style of the others. The team started with a dataset of 200 men and women reading audiobooks aloud, and to minimize computational costs, narrowed the field to 20 speakers.

From there, they dubbed the voice of each speaker over all the others, preserving the unique speaking style of each. In the end, they picked the voice that paid listeners on Amazon Mechanical Turk rated as nearly indistinguishable from its facsimile. They integrated that voice into ContentVEC.

When the researchers ran the new and improved HuBERT foundation model on a range of speech-recognition tasks, they found that their model outperformed the HuBERT baseline by about 10 percent, a sizeable improvement in accuracy.

The researchers plan to next train the model on recorded dialogue and pairs of questions and answers, using fewer labeled examples and computational resources. The larger goal is to translate thousands of endangered languages worldwide into machine-readable form — to preserve them and to train AI language models to make time-saving applications available to more people.

Many threatened languages have no written form, making them especially difficult for AI models to crack. It’s a problem with special urgency for Zhang, who grew up in Guangzhou, a port city in southern China ringed by skyscrapers.

Zhang is fluent in Mandarin and Cantonese, but never learned to say more than a few phrases in the Zhongshan dialect his mother grew up speaking in a village two hours away. Time to learn it is running out.

“More and more people have moved out of the village, and the younger generation has stopped speaking this dialect,” he said. “Today, there are fewer than 100 speakers left, and the neighboring villages are in a similar situation.”

If foundation models for spoken language continue to advance, it could be possible to save the Zhongshan dialect and others like it for future generations.

“It would make my grandma very happy,” he said.