Close Menu
Versa AI hub
  • AI Ethics
  • AI Legislation
  • Business
  • Cybersecurity
  • Media and Entertainment
  • Content Creation
  • Art Generation
  • Research
  • Tools

Subscribe to Updates

Subscribe to our newsletter and stay updated with the latest news and exclusive offers.

What's Hot

Salesforce AgentForce 3 brings visibility to AI agents

June 25, 2025

Generated AI Media Production | Deloitte Us

June 24, 2025

6 Key findings from marketing leaders

June 24, 2025
Facebook X (Twitter) Instagram
Versa AI hubVersa AI hub
Wednesday, June 25
Facebook X (Twitter) Instagram
Login
  • AI Ethics
  • AI Legislation
  • Business
  • Cybersecurity
  • Media and Entertainment
  • Content Creation
  • Art Generation
  • Research
  • Tools
Versa AI hub
Home»Research»Harvard and Stanford University research reveals serious flaws in medical AI systems
Research

Harvard and Stanford University research reveals serious flaws in medical AI systems

By January 3, 2025Updated:February 13, 2025No Comments6 Mins Read
Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
Share
Facebook Twitter LinkedIn Pinterest Email
robot or AI doctor

(©BiancoBlue | Dreamstime.com)

In a nutshell

When AI doctors had to diagnose through conversations rather than multiple-choice tests, accuracy dropped dramatically, in some cases from 82% to 26%. Those who struggle with basic clinical skills, such as asking follow-up questions and integrating multiple pieces of information, find that AI Tools are not yet ready for independent patient interaction, suggesting that they should complement, rather than replace, human physicians.

BOSTON — Artificial intelligence is showing significant promise in healthcare, from reading X-rays to suggesting treatment plans. But a new study from Harvard Medical School and Stanford University finds that AI still has major limitations when it comes to actually speaking with patients and making accurate diagnoses through conversation, a cornerstone of medical practice.

The study, published in Nature Medicine, introduces an innovative testing framework called CRAFT-MD (Conversational Reasoning Assessment Framework for Testing in Medicine) that uses large-scale language models ( Assess how well the LLM) is performing. As patients increasingly rely on AI tools like ChatGPT to interpret symptoms and medical test results, it is important to understand the real-world functionality of these systems.

“Our study reveals a striking contradiction: These AI models excel at medical board exams, but struggle with the basic interactions of a doctor’s visit,” said the study’s lead author. explains Pranav Rajpurkar, assistant professor of biomedical informatics at Harvard Medical School. . “The dynamic nature of medical conversations—the need to ask the right questions at the right time, piece together scattered information, and infer symptoms—is unique and extends far beyond answering multiple-choice questions. poses challenges.”

A research team led by senior authors Rajpurkar and Roxana Daneshjou from Stanford University evaluated four prominent AI models across 2,000 medical cases across 12 specialties. Current assessment methods typically rely on multiple-choice medical examination questions that present information in a structured format. But “in the real world, this process is trickier,” said Shreya Johri, co-lead author of the study.

Testing conducted through CRAFT-MD revealed significant differences in performance between traditional assessments and more realistic scenarios. For multiple-choice questions (MCQs) with four options, the diagnostic accuracy of the GPT-4 decreased from 82% when reading a prepared case summary to 63% when information was gathered through dialogue. did. This decrease was even more pronounced in open-ended scenarios without multiple-choice options, with accuracy dropping to 49% for written summaries and 26% for mock patient interviews.

AI models have shown particular difficulty in synthesizing information from multiple conversational exchanges. Common problems include missing important details during a patient medical history, not asking appropriate follow-up questions, and combining visual data from medical images with patient-reported symptoms. This includes having difficulty integrating different types of information.

The efficiency of CRAFT-MD highlights another advantage of this framework. That means 10,000 conversations can be processed in 48-72 hours, plus an additional 15-16 hours of expert evaluation. Traditional human assessment requires large-scale adoption, with approximately 500 hours of patient simulation and 650 hours of expert assessment.

“As a physician-scientist, I am interested in AI models that can effectively and ethically enhance clinical practice,” says Daneshjou, assistant professor of biomedical data science and dermatology at Stanford University. “CRAFT-MD will help advance the field when testing the performance of AI models in healthcare as it creates a framework that more closely reflects real-world interactions.”

Based on these findings, the researchers provided comprehensive recommendations for the development and regulation of AI. These include creating models that can handle unstructured conversations, better integrating different data types (text, images, clinical measurements), and the ability to interpret nonverbal communication cues. It also highlights the importance of combining AI-based assessments with human expert assessments to ensure a thorough examination while avoiding premature exposure of real patients to unvalidated systems. Masu.

This study shows that while AI holds promise in the medical field, significant progress is needed to ensure that current systems can engage with the complex and dynamic nature of real-world doctor-patient interactions. is shown. For now, these tools may work best as a complement to, rather than a replacement for, human medical expertise.

Paper summary

methodology

Researchers have created an advanced testing system in which one AI plays the role of a patient (providing information based on real-world medical cases) and another AI plays the role of a doctor (asking questions and making diagnoses). Created. Medical experts reviewed these interactions to ensure quality and accuracy. The study included 2,000 cases across a variety of medical specialties and included multiple formats, including traditional written case summaries, back-and-forth conversations, single-question diagnoses, and summarized conversations. Tested. We also tested scenarios with and without multiple diagnosis options.

result

A key finding was that AI performance significantly degraded when moving from written summaries to conversational diagnosis. When using the multiple selection option, GPT-4’s accuracy dropped from 82% to 63%. Accuracy decreased even more dramatically when multiple choices were not available, dropping to 26% for conversational diagnosis. AI also struggled with integrating information from multiple exchanges and knowing when to stop collecting information.

Restrictions

This study primarily used simulated patient interactions rather than real patients, so it may not fully capture the complexity of real-world clinical situations. The study also focused primarily on diagnostic accuracy, rather than other important aspects of medical care such as bedside manner and emotional support. Additionally, the study used AI to simulate patient responses, which may not fully reflect how patients communicate in real life.

Discussion and key points

This study suggests that current AI models, while good at certain structured tasks, are not yet ready for independent patient interaction. The findings indicate that AI may be used more effectively as an adjunct tool rather than a replacement for human doctors. The study also highlights the importance of developing AI systems that can better handle dynamic conversations and information synthesis.

Funding and disclosure

This research was supported by an HMS Dean’s Innovation Award and an Accelerate Foundation Models Research grant from Microsoft awarded to Pranav Rajpurkar. Additional funding was provided through an IIE Quad Fellowship. Several researchers disclosed industry connections, including Daneshjou’s consulting positions with DWA, Pfizer, L’Oréal, and VisualDx, as well as stock options with medical technology companies. Other disclosures include pending patents and various advisory and equity positions held by team members at the healthcare company.

Publication information

This study was published by researchers at Harvard Medical School, Stanford University, and other leading medical institutions as Nature Medicine (DOI: 10.1038/s41591-024 -03328-5). .

author avatar
See Full Bio
Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
Previous ArticleGoogle DeepMind at NeurIPS 2023
Next Article AI-Powered Social Media Series: Unveiling Your Hard Drive

Related Posts

Research

How to turn AI into your own research assistant with this free Google tool

June 20, 2025
Research

A new study of 408 researchers revealed split sentiment, a surge in recruitment and rising barriers to trust

June 20, 2025
Research

Info-Tech Research Group publishes insights into how AI can make a difference

June 20, 2025
Add A Comment
Leave A Reply Cancel Reply

Top Posts

New Star: Discover why 보니 is the future of AI art

February 26, 20253 Views

How to build an MCP server with Gradio

April 30, 20251 Views

The UAE announces bold AI-led plans to revolutionize the law

April 22, 20251 Views
Stay In Touch
  • YouTube
  • TikTok
  • Twitter
  • Instagram
  • Threads
Latest Reviews

Subscribe to Updates

Subscribe to our newsletter and stay updated with the latest news and exclusive offers.

Most Popular

New Star: Discover why 보니 is the future of AI art

February 26, 20253 Views

How to build an MCP server with Gradio

April 30, 20251 Views

The UAE announces bold AI-led plans to revolutionize the law

April 22, 20251 Views
Don't Miss

Salesforce AgentForce 3 brings visibility to AI agents

June 25, 2025

Generated AI Media Production | Deloitte Us

June 24, 2025

6 Key findings from marketing leaders

June 24, 2025
Service Area
X (Twitter) Instagram YouTube TikTok Threads RSS
  • About Us
  • Contact Us
  • Privacy Policy
  • Terms and Conditions
  • Disclaimer
© 2025 Versa AI Hub. All Rights Reserved.

Type above and press Enter to search. Press Esc to cancel.

Sign In or Register

Welcome Back!

Login to your account below.

Lost password?