
(©BiancoBlue | Dreamstime.com)
In a nutshell
When AI doctors had to diagnose through conversations rather than multiple-choice tests, accuracy dropped dramatically, in some cases from 82% to 26%. Those who struggle with basic clinical skills, such as asking follow-up questions and integrating multiple pieces of information, find that AI Tools are not yet ready for independent patient interaction, suggesting that they should complement, rather than replace, human physicians.
BOSTON — Artificial intelligence is showing significant promise in healthcare, from reading X-rays to suggesting treatment plans. But a new study from Harvard Medical School and Stanford University finds that AI still has major limitations when it comes to actually speaking with patients and making accurate diagnoses through conversation, a cornerstone of medical practice.
The study, published in Nature Medicine, introduces an innovative testing framework called CRAFT-MD (Conversational Reasoning Assessment Framework for Testing in Medicine) that uses large-scale language models ( Assess how well the LLM) is performing. As patients increasingly rely on AI tools like ChatGPT to interpret symptoms and medical test results, it is important to understand the real-world functionality of these systems.
“Our study reveals a striking contradiction: These AI models excel at medical board exams, but struggle with the basic interactions of a doctor’s visit,” said the study’s lead author. explains Pranav Rajpurkar, assistant professor of biomedical informatics at Harvard Medical School. . “The dynamic nature of medical conversations—the need to ask the right questions at the right time, piece together scattered information, and infer symptoms—is unique and extends far beyond answering multiple-choice questions. poses challenges.”
A research team led by senior authors Rajpurkar and Roxana Daneshjou from Stanford University evaluated four prominent AI models across 2,000 medical cases across 12 specialties. Current assessment methods typically rely on multiple-choice medical examination questions that present information in a structured format. But “in the real world, this process is trickier,” said Shreya Johri, co-lead author of the study.
Testing conducted through CRAFT-MD revealed significant differences in performance between traditional assessments and more realistic scenarios. For multiple-choice questions (MCQs) with four options, the diagnostic accuracy of the GPT-4 decreased from 82% when reading a prepared case summary to 63% when information was gathered through dialogue. did. This decrease was even more pronounced in open-ended scenarios without multiple-choice options, with accuracy dropping to 49% for written summaries and 26% for mock patient interviews.
AI models have shown particular difficulty in synthesizing information from multiple conversational exchanges. Common problems include missing important details during a patient medical history, not asking appropriate follow-up questions, and combining visual data from medical images with patient-reported symptoms. This includes having difficulty integrating different types of information.
The efficiency of CRAFT-MD highlights another advantage of this framework. That means 10,000 conversations can be processed in 48-72 hours, plus an additional 15-16 hours of expert evaluation. Traditional human assessment requires large-scale adoption, with approximately 500 hours of patient simulation and 650 hours of expert assessment.
“As a physician-scientist, I am interested in AI models that can effectively and ethically enhance clinical practice,” says Daneshjou, assistant professor of biomedical data science and dermatology at Stanford University. “CRAFT-MD will help advance the field when testing the performance of AI models in healthcare as it creates a framework that more closely reflects real-world interactions.”
Based on these findings, the researchers provided comprehensive recommendations for the development and regulation of AI. These include creating models that can handle unstructured conversations, better integrating different data types (text, images, clinical measurements), and the ability to interpret nonverbal communication cues. It also highlights the importance of combining AI-based assessments with human expert assessments to ensure a thorough examination while avoiding premature exposure of real patients to unvalidated systems. Masu.
This study shows that while AI holds promise in the medical field, significant progress is needed to ensure that current systems can engage with the complex and dynamic nature of real-world doctor-patient interactions. is shown. For now, these tools may work best as a complement to, rather than a replacement for, human medical expertise.
Paper summary
methodology
Researchers have created an advanced testing system in which one AI plays the role of a patient (providing information based on real-world medical cases) and another AI plays the role of a doctor (asking questions and making diagnoses). Created. Medical experts reviewed these interactions to ensure quality and accuracy. The study included 2,000 cases across a variety of medical specialties and included multiple formats, including traditional written case summaries, back-and-forth conversations, single-question diagnoses, and summarized conversations. Tested. We also tested scenarios with and without multiple diagnosis options.
result
A key finding was that AI performance significantly degraded when moving from written summaries to conversational diagnosis. When using the multiple selection option, GPT-4’s accuracy dropped from 82% to 63%. Accuracy decreased even more dramatically when multiple choices were not available, dropping to 26% for conversational diagnosis. AI also struggled with integrating information from multiple exchanges and knowing when to stop collecting information.
Restrictions
This study primarily used simulated patient interactions rather than real patients, so it may not fully capture the complexity of real-world clinical situations. The study also focused primarily on diagnostic accuracy, rather than other important aspects of medical care such as bedside manner and emotional support. Additionally, the study used AI to simulate patient responses, which may not fully reflect how patients communicate in real life.
Discussion and key points
This study suggests that current AI models, while good at certain structured tasks, are not yet ready for independent patient interaction. The findings indicate that AI may be used more effectively as an adjunct tool rather than a replacement for human doctors. The study also highlights the importance of developing AI systems that can better handle dynamic conversations and information synthesis.
Funding and disclosure
This research was supported by an HMS Dean’s Innovation Award and an Accelerate Foundation Models Research grant from Microsoft awarded to Pranav Rajpurkar. Additional funding was provided through an IIE Quad Fellowship. Several researchers disclosed industry connections, including Daneshjou’s consulting positions with DWA, Pfizer, L’Oréal, and VisualDx, as well as stock options with medical technology companies. Other disclosures include pending patents and various advisory and equity positions held by team members at the healthcare company.
Publication information
This study was published by researchers at Harvard Medical School, Stanford University, and other leading medical institutions as Nature Medicine (DOI: 10.1038/s41591-024 -03328-5). .