Can Algorithms Measure Fairness Like They Measure Accuracy?

This year, Facebook has relaxed some of its rules regarding hate speech and abusive content. Alongside similar actions on X (previously Twitter) after Elon Musk’s acquisition, these changes make it more difficult for users to avoid harmful speech online.

However, social networks and online platforms are still striving to tackle the massive challenge of content moderation to keep users safe. One innovative method being explored involves artificial intelligence (AI). These AI tools can process large amounts of content, reducing the burden on human moderators who often face distressing material.

Yet, AI moderation has its difficulties, particularly in ensuring fairness while maintaining accuracy. Maria De-Arteaga, an assistant professor in information management at Texas McCombs, points out that while an algorithm might excel at spotting toxic language overall, it might not perform equally well across different groups.

“If I look only at the overall results, I might conclude the model is effective, even if it fails to accurately identify toxic speech for a smaller subset of users,” she explains. For instance, it could be more adept at recognizing offensive language towards one ethnic group compared to another.

Recent research led by De-Arteaga and her colleagues indicates that it’s feasible to achieve both high accuracy and fairness in AI moderation. They have developed an algorithm designed to help stakeholders find the right balance between these elements based on their specific needs.

The study’s results can be found in Information Research, an international electronic journal.

Together with Matthew Lease, a professor, and graduate students Soumyajit Gupta and Anubrata Das from UT’s School of Information, as well as Venelin Kovatchev from the University of Birmingham, De-Arteaga utilized a dataset of 114,000 social media posts that had been previously classified as “toxic” or “nontoxic” by other researchers.

The team applied a fairness metric known as Group Accuracy Parity (GAP) to train a machine-learning model that seeks to balance fairness with accuracy. Their AI-driven analysis of the datasets yielded:

A performance improvement of up to 1.5% better than the next most effective method for fair treatment across all groups.
The highest effectiveness in maximizing both fairness and accuracy simultaneously.

However, De-Arteaga cautions that GAP isn’t a universal solution for fairness. Different stakeholders may prioritize various fairness metrics, and the data required to train these systems can vary widely based on the specific groups and contexts involved.

For instance, perceptions of what constitutes toxic speech can differ among groups, and these standards can change over time.

A failure to capture these nuances could lead to unjustly removing individuals from online spaces for misclassifying harmless speech as toxic, or conversely, could expose users to more hateful content.

This challenge grows for platforms like Facebook and X, which cater to a global and diverse audience.

“How do you address fairness from the design of the data and algorithms without a solely U.S.-centric view?” De-Arteaga questions.

As a result, it may be necessary to continuously update algorithms, and developers must adapt them to the specific circumstances and types of content they are monitoring. To support this ongoing effort, the researchers have made the GAP code accessible to the public.

De-Arteaga affirms that achieving high levels of fairness and accuracy is possible when designers consider both technical and cultural aspects.

“You need a commitment and interdisciplinary knowledge,” she states. “It’s crucial to incorporate these considerations.”

More information: Soumyajit Gupta et al, Finding Pareto trade-offs in fair and accurate detection of toxic speech, Information Research, an international electronic journal (2025). DOI: 10.47989/ir30iConf47572

If you would like to see similar Tech posts like this, click here & share this article with your friends!