Skip to main content

‘Detoxified’ language models might marginalize minorities, says study

Image Credit: raindrop74 / Shutterstock

Join us in Atlanta on April 10th and explore the landscape of security workforce. We will explore the vision, benefits, and use cases of AI for security teams. Request an invite here.


AI language models like GPT-3 have an aptitude for generating humanlike text. A key factor is the large datasets, scraped from the web, on which they’re trained. But because the datasets are often too large to filter with precision, they contain expletives, slurs, and other offensive and threatening speech. Language models unavoidably learn to generate toxic text when trained on this data.

To address this, research has pivoted toward “detoxifying” language models without affecting the quality of text that they generate. Existing strategies employ techniques like fine-tuning language models on nontoxic data and using “toxicity classifiers.” But while these are effective, a new study from researchers at the University of California, Berkeley, and the University of Washington finds issue with some of the most common detoxification approaches. According to the coauthors, language model detoxification strategies risk marginalizing minority voices.

Natural language models are the building blocks of apps including machine translators, text summarizers, chatbots, and writing assistants. But there’s growing evidence showing that these models risk reinforcing undesirable stereotypes, mostly because a portion of the training data is commonly sourced from communities with gender, race, and religious prejudices. Detoxification has been proposed as a solution to this problem, but the coauthors of this latest research — as well as research from the Allen Institute — found that the technique can amplify rather than mitigate biases.

In their study, the UC Berkeley and University of Washington researchers evaluated “detoxified” language models on text with “minority identity mentions” including words like “gay” and “Muslim,” as well as surface markers of African-American English (AAE). AAE, also known as Black English in American linguistics, refers to the speech distinctive to many Black people in the U.S. and Canada.

VB Event

The AI Impact Tour – Atlanta

Continuing our tour, we’re headed to Atlanta for the AI Impact Tour stop on April 10th. This exclusive, invite-only event, in partnership with Microsoft, will feature discussions on how generative AI is transforming the security workforce. Space is limited, so request an invite today.
Request an invite

The researchers — who used GPT-2, the predecessor to GPT-3, as a test model — showed that three different kinds of detoxification methods caused a disproportionate increase in language model perplexity on text with African-American English and minority identity mentions. In machine learning, perplexity is a measurement of the quality of a model’s outputs — lower is generally better. Using a curated version of English Jigsaw Civil Comments for training, a dataset from Alphabet-owned anti-cyberbullying firm Jigsaw, the researchers found that perplexity increased by a factor of 2.1 on nontoxic “white-aligned English” data and a factor of 4.3 on minority identity mention data. Increasing the strength of the detoxification worsened the bias.

Why might this happen? The coauthors speculate that toxicity datasets like English Jigsaw Civil Comments contain spurious correlations between the presence of AAE and minority identity mentions and “toxic” labels — the labels from which the language models learn. These correlations cause detoxification techniques to steer models away from AAE and minority identity mentions because the models wrongly learn to consider these aspects of language to be toxic.

As the researchers note, the study’s results suggest that detoxified language models deployed into production might struggle to understand aspects of minority languages and dialects. This could force people using the models to switch to white-aligned English to ensure that the models work better for them, which could discourage minority speakers from engaging with the models to begin with. Moreover, because detoxified models tend to avoid certain topics mentioning minority identity terms, like religions including Islam, they could lead to ostracization and a lack of informed, conscious discussion on topics of identity. For example, tailoring an language model for white-aligned English could stigmatize AAE as incorrect or “bad” English.

In the absence of ways to train accurate models in the presence of biased data, the researchers propose improving toxicity datasets as a potential way forward. “Language models must be both safe and equitable to be responsibly deployed in practice. Unfortunately, state-of-the-art debiasing methods are still far from perfect,” they wrote in the paper. “We plan to explore new methods for debiasing both datasets and models in future work.”

The increasing attention on language biases comes as some within the AI community call for greater consideration of the role of social hierarchies like racism. In a paper published last June, Microsoft researchers advocated for a closer examination and exploration of the relationships between language, power, and prejudice in their work. The paper also concluded that the research field generally lacks clear descriptions of bias and fails to explain how, why, and to whom that bias is harmful.

VB Daily - get the latest in your inbox

Thanks for subscribing. Check out more VB newsletters here.

An error occured.