Researchers from the Georgia Institute of Technology have gone on to find out that the chatbots happen to be less accurate in Hindi, Chinese, and Spanish if asked health-related questions.
The researchers say that non-English speakers should not depend on chatbots like ChatGPT so as to get valuable healthcare advice. One of the team of researchers at Georgia Tech’s College of Computing has gone on to come up with a framework for evaluating the capabilities pertaining to large language models- LLMs. Mohit Chandra and Yiqiao Jin happen to be the co-lead authors of Better to Ask in English Cross-Lingual Evaluation of Large Language Models for Healthcare Queries.
The findings in their paper happens to reveal a gap between the LLMs as well as their capability to answer questions pertaining to health. Both, Chandra and Jin pointed out the drawbacks of LLMs when it comes to users as well as developers, but at the same time, they also highlighted the potential they have.
The XLingEval framework goes on to caution non-English speakers against making use of chatbots as one of the alternatives to doctors for advice.
But at the same time, the models can very well be enhanced by deepening the data pool through multi-lingual source material like the proposed XLingHealth benchmark.
According to Jin, for users, their research goes on to support what the ChatGPT website already happens to state: chatbots do make a lot of mistakes, and hence one should not depend on them for major decision-making or for any information that needs high levels of accuracy. Since a language disparity has already been observed in the chatbot performance, LLM developers must go on to focus on enhancing the correctness, consistency, dependability, and precision in terms of other languages.
By way of using XLingEval, the researchers gauged that the chatbots are way less accurate when it comes to Hindi, Chinese, and Spanish as compared to English. Upon focusing on consistency and correctness, they went on to discover that:
- Their correctness depleted by 18% when similar questions were asked in Hindi, Chinese, and Spanish.
- The answers in non-English happened to be 29% less consistent as compared to their English counterparts.
- The non-English responses happened to be 13% less verifiable.
XLingHealth happens to have question-answer pairs that the chatbots can use as a reference, which, as per the group, would go on to improve LLMs. The HealthQA data makes use of specialized healthcare articles from a popular healthcare website called Patient, This happens to include 1134 health-related question-answer pairs, which happen to be excerpts from articles. LiveQA happens to be the second dataset that has 246 question-answer pairs that happen to be created from numerous FAQs, with the group building a MedicationQA component. This dataset happens to have 690 questions, which are extracted from many consumer queries that are submitted to MedilinePlus as well as DailyMed.
In these tests, the researchers went ahead and asked more than 2000 medical-centric questions to the ChatGPT-3.5 as well as MedAlpaca, which is a healthcare question-answer chatbot that is trained in medical literature.
Still, more than 60% of the responses to the non-English questions happened to be irrelevant or contradictory. As per Chandra, they saw far worse performance in the case of MedAlpaca than ChatGPT. It is well to be noted that the majority of the MedAlpaca data happens to be in English, and hence it naturally struggles to answer queries in non-English languages. GPT also had its own set of struggles, yet it performed much better than MedAlpaca as it had some training data pertaining to other languages.
Apparently, the group went on to test Hindi, Chinese, and Spanish, as these languages happen to be the most spoken in the world after English. It is well to be noted that personal curiosity as well as the background went on to play a critical role when it came to inspiring the study. As per Jin, ChatGPT happened to be a very prominent and popular tool when it got launched in 2022, especially for computer science scholars who always look to discover new tech, and non-native English speakers such as him and Mohit went on to notice that chatbots underperformed when it came to their native languages.