It’s tempting to turn to search engines to seek out health information, but with the rise of large language models, like ChatGPT, people are becoming more and more likely to depend on AI for answers too.
Concerningly, an Australian study has now found that the more evidence given to ChatGPT when asked a health-related question, the less reliable it becomes.
Large language models (LLM) and artificial intelligence use in health care is still developing, creating a a critical gap when providing incorrect answers can have serious consequences for people’s health.
To address this, scientists from Australia’s national science agency, CSIRO, and the University of Queensland (UQ) explored a hypothetical scenario: an average person asking ChatGPT if ‘X’ treatment has a positive effect on condition ‘Y’.
They presented ChatGPT with 100 questions sourced from the TREC Health Misinformation track – ranging from ‘Can zinc help treat the common cold?’ to ‘Will drinking vinegar dissolve a stuck fish bone?’
Because queries to search engines are typically shorter, while prompts to a LLM can be far longer, they posed the questions in 2 different formats: the first as a simple question and the second as a question biased with supporting or contrary evidence.
By comparing ChatGPT’s response to the known correct response based on existing medical knowledge, they found that ChatGPT was 80% accurate at giving accurate answers in a question-only format. However, when given an evidence-biased prompt, this accuracy reduced to 63%, which was reduced again to 28% when an “unsure” answer was allowed.
“We’re not sure why this happens,” says CSIRO Principal Research Scientist and Associate Professor at UQ, Dr Bevan Koopman, who is co-author of the paper.
“But given this occurs whether the evidence given is correct or not, perhaps the evidence adds too much noise, thus lowering accuracy.”
Study co-author Guido Zuccon, Director of AI for the Queensland Digital Health Centre at UQ says that major search engines are now integrating LLMs and search technologies in a process called Retrieval Augmented Generation.
“We demonstrate that the interaction between the LLM and the search component is still poorly understood, resulting in the generation of inaccurate health information,” says Zuccon.
Given the widespread popularity of using LLMs online for answers on people’s health, Koopman adds, we need continued research to inform the public about risks and to help them optimise the accuracy of their answers.
“While LLMs have the potential to greatly improve the way people access information, we need more research to understand where they are effective and where they are not.”