A new framework using insights from the humanities and the social sciences could help prevent artificial intelligence (AI) tools from spreading misinformation and discriminatory content according to analysis from the UK.
To address shortcomings of Large Language Model (LLM) systems, like ChatGPT, researchers targeted the databases on which the AI is trained and integrated principles from sociolinguistics, which is the study of language variation and change.
A study published in Frontiers in AI emphasises the importance of representing diverse dialects, registers and periods in which language is produced.
“When prompted, generative AIs such as ChatGPT may be more likely to produce negative portrayals about certain ethnicities and genders,” says lead author Jack Grieve from the University of Birmingham, UK, “but our research offers solutions for how LLMs can be trained in a more principled manner to mitigate social biases.”
“These types of issues can generally be traced back to the data that the LLM was trained on. If the training contains relatively frequent expression of harmful or inaccurate ideas about certain social groups, LLMs will inevitably reproduce those biases, resulting in potentially racist or sexist content.”
The team suggests that fine-tuning LLMs on datasets which encapsulate diversity of language can enhance the output by the AI system.
“We propose that increasing the sociolinguistic diversity of training data is far more important than merely expanding its scale,” Grieve adds. “For all these reasons, we therefore believe there is a clear and urgent need for sociolinguistic insight in LLM design and evaluation.
“Understanding the structure of society, and how this structure is reflected in patterns of language use, is critical to maximising the benefits of LLMs for the societies in which they are increasingly being embedded. More generally, incorporating insights from the humanities and the social sciences is crucial for developing AI systems that better serve humanity.”