Healthcare settings are introducing generative AI models, sometimes before they are fully prepared. Those who are quick to embrace new technology are confident that it will lead to improved productivity and a deeper understanding of data. On the other hand, critics highlight the flaws and biases in these models that may potentially lead to poorer health outcomes.
Is there a way to determine a model’s effectiveness or potential drawbacks when it comes to tasks like summarizing patient records or answering health-related questions?
Hugging Face, the AI startup, introduces a solution in a recently launched benchmark test known as Open Medical-LLM. Developed in collaboration with researchers from the nonprofit Open Life Science AI and the University of Edinburgh’s Natural Language Processing Group, Open Medical-LLM seeks to establish a standardized approach for assessing the effectiveness of generative AI models across various medical tasks.
Open Medical-LLM is not created from scratch but rather a compilation of existing test sets like MedQA, PubMedQA, MedMCQA, and others. It aims to evaluate models for their understanding of general medical knowledge and related fields, including anatomy, pharmacology, genetics, and clinical practice. The benchmark includes a variety of questions that test your medical reasoning and understanding. These questions are based on material from U.S. and Indian medical licensing exams, as well as college biology test question banks.
“[Open Medical-LLM] empowers researchers and practitioners to assess the merits and drawbacks of various approaches, propel further progress in the field, and ultimately enhance patient care and outcomes,” stated Hugging Face in a blog post.
Hugging Face presents the benchmark as a comprehensive evaluation of healthcare-focused generative AI models. However, a few medical experts on social media advised against relying too heavily on Open Medical-LLM, as it could potentially result in uninformed implementations.
On X, Liam McCoy, a resident physician in neurology at the University of Alberta, highlighted the significant disparity between the artificial setting of medical question-answering and real-life clinical practice.
Clémentine Fourrier, a Hugging Face research scientist who also contributed to the blog post, concurred.
“These leaderboards can serve as an initial guide for determining which generative AI model to explore for a specific use case. However, it is important to conduct thorough testing to truly understand the model’s limitations and relevance in real-world conditions,” Fourrier responded on X. Patients should not solely rely on medical models; instead, medical professionals should train them to serve as helpful tools.
It reminds me of Google’s attempt to introduce an AI screening tool for diabetic retinopathy to healthcare systems in Thailand.
Google developed a deep learning system that analyzed eye images to detect signs of retinopathy, a major contributor to vision impairment. However, despite its high theoretical accuracy, the tool proved to be impractical during real-world testing. This led to frustration among both patients and nurses due to inconsistent results and a lack of alignment with on-the-ground practices.
It’s interesting to note that out of the 139 AI-related medical devices approved by the U.S. Food and Drug Administration so far, none of them utilize generative AI. Testing the performance of a generative AI tool in a controlled environment is challenging enough, but the real question is how it will fare in real-world settings like hospitals and outpatient clinics. Equally important is understanding how the outcomes will evolve over time.
It is important to note that Open Medical-LLM is indeed useful and informative. The results leaderboard, if nothing else, is a stark reminder of the inadequate performance of models in addressing fundamental health inquiries. However, Open Medical-LLM, or any other benchmark for that matter, cannot replace the importance of thorough real-world testing.