Skip to main content

AI Weekly: These researchers are improving AI’s ability to understand different accents

Amazon's Echo smart speaker with Alexa
Amazon's Echo smart speaker with Alexa.
Image Credit: Amazon

The pandemic appears to have supercharged voice app usage, which was already on an upswing. According to a study by NPR and Edison Research, the percentage of voice-enabled device owners who use commands at least once a day rose between the beginning of 2020 and the start of April. Just over a third of smart speaker owners say they listen to more music, entertainment, and news from their devices than they did before, and owners report requesting an average of 10.8 tasks per week from their assistant this year compared with 9.4 different tasks in 2019. According to a new report from Juniper Research, consumers will interact with voice assistants on 8.4 billion devices by 2024.

But despite their growing popularity, assistants like Alexa, Google Assistant, and Siri still struggle to understand diverse regional accents. According to a study by the Life Science Centre, 79% of people with accents alter their voice to make sure that they’re understood by their digital assistants. And in a recent survey commissioned by the Washington Post, popular smart speakers made by Google and Amazon were 30% less likely to understand non-American accents than those of native-born users.

Traditional approaches to narrowing the accent gap would require collecting and labeling large datasets of different languages, a time- and resource-intensive process. That’s why researchers at MLCommons, a nonprofit related to MLPerf, an industry-standard set of benchmarks for machine learning performance, are embarking on a project called 1000 Words in 1000 Languages. It’ll involve creating a freely available pipeline that can take any recorded speech and automatically generate clips to train compact speech recognition models.

“In the context of consumer electronic devices, for instance, you don’t want to have to go out and build new language datasets because that’s costly, tedious, and error-prone,” Vijay Janapa Reddi, an associate professor at Harvard and a contributor on the project, told VentureBeat in a phone interview. “What we’re developing is a modular pipeline where you’ll be able to plug in different sources speech and then specify the [words] for training that you want.”

VB Event

The AI Impact Tour – Atlanta

Continuing our tour, we’re headed to Atlanta for the AI Impact Tour stop on April 10th. This exclusive, invite-only event, in partnership with Microsoft, will feature discussions on how generative AI is transforming the security workforce. Space is limited, so request an invite today.
Request an invite

While the pipeline will be limited in scope in that it’ll only create training datasets for small, low-power models that continually listen for specific keywords (e.g. “OK Google” or “Alexa”), it could represent a significant step toward truly accent-agnostic speech recognition systems. By convention, training a new keyword-spotting model would require manually collecting thousands of examples of labeled audio clips for each keyword. When the pipeline is released, developers will be able to simply provide a list of keywords they wish to detect along with a speech recording and the pipeline will automate the extraction, training, and validation of models without requiring any labeling.

“It’s not even really creating a dataset, it’s just training a dataset that comes about as a result of searching the larger corpus,” Reddi explained. “It’s like doing a Google search. What you’re trying to do is find a needle in a haystack — you end up with a subset of results with different accents and whatever else you have in there.”

The 1000 Words in 1000 Languages project builds on existing efforts to make speech recognition models more accessible — and equitable. Mozilla’s Common Voice, an open source and annotated speech dataset, consists of voice snippets and voluntarily contributed metadata useful for training speech engines like speakers’ ages, sex, and accents. As a part of Common Voice, Mozilla maintains a dataset target segment that aims to collect voice data for specific purposes and use cases, including the digits “zero” through “nine” as well as the words “yes,” “no,” “hey,” and “Firefox.” For its part, in December, MLCommons released the first iteration of a public 86,000-hour dataset for AI researchers, with later versions due to branch into more languages and accents.

“The organizations that have a huge amount of speech are often large organizations, but speech is something that has many applications,” Reddi said. “The question is, how do you get this into the hands of small organizations that don’t have the same scale as big entities like Google and Microsoft? If they have a pipeline, they can just focus on what they’re building.”

For AI coverage, send news tips to Khari Johnson and Kyle Wiggers — and be sure to subscribe to the AI Weekly newsletter and bookmark our AI channel, The Machine.

Thanks for reading,

Kyle Wiggers

AI Staff Writer

VB Daily - get the latest in your inbox

Thanks for subscribing. Check out more VB newsletters here.

An error occured.