Mohit Chandra and Yiqiao (Ahren) Jin sitting in front of computer screens
Mohit Chandra and Yiqiao (Ahren) Jin discovered language disparity in the performance of populat chatbots like ChatGPT.

Image source: Georgia Tech

News • Language barriers for health information

Chatbots get less accurate when health queries are not in English

Researchers at the Georgia Institute of Technology found that chatbots are less accurate in Spanish, Chinese, and Hindi compared to English when asked health-related questions.

The researchers say non-English speakers shouldn’t rely on chatbots like ChatGPT to provide valuable healthcare advice. A team of researchers from the College of Computing at Georgia Tech has developed a framework for assessing the capabilities of large language models (LLMs). Ph.D. students Mohit Chandra and Yiqiao (Ahren) Jin are the co-lead authors of the paper Better to Ask in English: Cross-Lingual Evaluation of Large Language Models for Healthcare Queries, which is available as a preprint (what does that mean?) on arXiv. 

Their paper’s findings reveal a gap between LLMs and their ability to answer health-related questions. Chandra and Jin point out the limitations of LLMs for users and developers but also highlight their potential.

Since we observed this language disparity in their performance, LLM developers should focus on improving accuracy, correctness, consistency, and reliability in other languages

Yiqiao (Ahren) Jin

Their XLingEval framework cautions non-English speakers from using chatbots as alternatives to doctors for advice. However, models can improve by deepening the data pool with multilingual source material such as their proposed XLingHealth benchmark. “For users, our research supports what ChatGPT’s website already states: chatbots make a lot of mistakes, so we should not rely on them for critical decision-making or for information that requires high accuracy,” Jin said. “Since we observed this language disparity in their performance, LLM developers should focus on improving accuracy, correctness, consistency, and reliability in other languages.” 

Using XLingEval, the researchers found chatbots are less accurate in Spanish, Chinese, and Hindi compared to English. By focusing on correctness, consistency, and verifiability, they discovered: 

  • Correctness decreased by 18% when the same questions were asked in Spanish, Chinese, and Hindi. 
  • Answers in non-English were 29% less consistent than their English counterparts.
  • Non-English responses were 13% overall less verifiable. 

XLingHealth contains question-answer pairs that chatbots can reference, which the group hopes will spark improvement within LLMs. The HealthQA dataset uses specialized healthcare articles from the popular healthcare website Patient. It includes 1,134 health-related question-answer pairs as excerpts from original articles. LiveQA is a second dataset containing 246 question-answer pairs constructed from frequently asked questions (FAQs) platforms associated with the U.S. National Institutes of Health (NIH). For drug-related questions, the group built a MedicationQA component. This dataset contains 690 questions extracted from anonymous consumer queries submitted to MedlinePlus. The answers are sourced from medical references, such as MedlinePlus and DailyMed. 

Recommended article


Article • Technology overview

Artificial intelligence (AI) in healthcare

With the help of artificial intelligence, computers are to simulate human thought processes. Machine learning is intended to support almost all medical specialties. But what is going on inside an AI algorithm, what are its decisions based on? Can you even entrust a medical diagnosis to a machine? Clarifying these questions remains a central aspect of AI research and development.

In their tests, the researchers asked over 2,000 medical-related questions to ChatGPT-3.5 and MedAlpaca, a healthcare question-answer chatbot trained in medical literature. Yet, more than 67% of its responses to non-English questions were irrelevant or contradictory. “We see far worse performance in the case of MedAlpaca than ChatGPT,” Chandra said. “The majority of the data for MedAlpaca is in English, so it struggled to answer queries in non-English languages. GPT also struggled, but it performed much better than MedAlpaca because it had some sort of training data in other languages.”  

The group tested Spanish, Chinese, and Hindi because they are the world’s most spoken languages after English. Personal curiosity and background played a part in inspiring the study. “ChatGPT was very popular when it launched in 2022, especially for us computer science students who are always exploring new technology,” said Jin. “Non-native English speakers, like Mohit and I, noticed early on that chatbots underperformed in our native languages.” 

Source: Georgia Institute of Technology


Read all latest stories

Related articles


News • Infrared thermography analysis

AI predicts coronary artery disease from facial thermal imaging

A combination of facial thermal imaging and artificial intelligence (AI) can accurately predict the presence of coronary artery disease (CAD), new research finds.


News • Warfarin, personalised

AI helps dosing anticoagulation meds in heart surgery patients

Warfarin is sometimes prescribed after heart surgery, but getting the dose right requires a personalised approach for each patient. A new AI tool is designed to help with this complex task.


News • Interpretable machine learning system

AI to detect colorectal cancer from pathology slides

Researchers work on the first prototype that applies AI to colorectal diagnosis. The prototype achieved a diagnostic acuity of 93.44% and a sensitivity of 99.7% in the detection of high-risk lesions.

Related products

Subscribe to Newsletter