The study was presented at the annual American College of Physicians (ACP) meeting on Friday, and at the annual conference of the American College of Physicians (ACP), which found that the diagnostic and treatment recommendations generated for AI are more accurate than human physician recommendations in an emergency care setting.
5 View the gallery
The study was conducted at the virtual emergency care center at Cedars-Sinai Medical Center in Los Angeles, which is run in collaboration with Israeli startup K Health.
Cedars-Sinai’s Virtual Care Center offers video consultations with family and emergency medical doctors. Recently, we proposed to integrate an AI system that conducts initial patient interviews via structured chat using machine learning, incorporating patient history and suggesting detailed diagnosis and treatment, including prescribing, testing and referral.
Here’s how it works: After chatting with the AI, the patient proceeds to a video visit with the doctor who makes the final call. Trained with millions of anonymized medical records, AI only offers recommendations when you have high confidence. Approximately 20% of cases withhold recommendations due to uncertainty.
In a previous study published last year, Zeltzer explained that the team compared AI diagnostic suggestions with doctors’ suggestions and found significant integrity for symptoms in general, particularly those related to respiratory and urinary problems. The new study took it a step further by comparing the quality of recommendations using a panel of experienced physicians.
The researchers analyzed 461 online patient visits in July 2024, involving adults with relatively common complaints (respiratory, urinary, ocular symptoms, and gynecological and dental issues). In both cases, the AI provided diagnostic and treatment suggestions before patients were seen by the physician.
5 View the gallery
Professor Dan Zelzer
(Photo: Richard Haldis))
A panel of physicians with at least 10 years of clinical experience evaluated all recommendations (AI and Human) using a four-layer scale. The assessment took into account the complete medical records, consultation transcripts and clinical data for each patient.
• AI recommendations were rated as optimal in 77% of cases compared to 67% of physicians.
• Potentially harmful recommendations were less frequent with AI (2.8%) than with physicians (4.6%).
•At 68% of visits, AI and physician received the same rating.
•In 21% of cases, AI was superior to physicians. In 11%, the opposite was true.
“These discoveries surprised us,” Zelzer said. “Across the wide range of symptoms, our panel of experts rated AI recommendations as more frequent and less dangerous than those made by doctors.”
One notable example was antibiotic formulations. “Doctors may unnecessarily prescribe antibiotics, such as viral infections that are ineffective,” Zeltzer said. “Patients may pressure their doctors to administer antibiotics, but AI is not upset. We do not recommend treatments that go against clinical guidelines.”
5 View the gallery
AI has also proven better by quickly cross-referencing medical history. “Practitioners who work under pressure don’t always look at the entire patient’s record,” he noted. “Ai can be done instantly.”
Taking Urinary Tract Infections: Treatment depends on whether it is the first occurrence, recurrence, or if the previous antibiotics have failed. “Some doctors didn’t take that into consideration and didn’t provide a very accurate treatment,” Selzer said. “The AI picked it up and adjusted it accordingly.”
When asked about the risk that AI produces false or misleading recommendations (so-called “hatography”), Zeltzer explained that the study uses a different class of AI than common language models such as ChatGPT.
5 View the gallery
“These models were trained with internet text and built to generate plausible sound responses, not to assess probability or medical accuracy,” he said. “In contrast, this AI system is trained on actual medical data and designed specifically to calculate diagnostic possibilities. If you are not confident, no recommendations will be made.”
The system used in this study issued recommendations in 80% of cases and withheld in 20%. It also aligns its proposals with established medical guidelines to increase reliability in high-stakes clinical settings.
Why test this now?
“This virtual clinic has given us an unusual opportunity to evaluate AI in real-world conditions,” Zeltzer said. “Many of AI research is based on health check questions and textbook cases, but actual patients are troubling. Symptoms are not always clearly explained, and that’s the real challenge.”
Selzer is cautiously optimistic about what this means in the future. “We can’t generalize these findings to all medical conditions, but in many cases, even in very good hospitals, the algorithms gave more accurate advice than the average doctor,” he said. “This suggests real possibilities for improving care and saving time.”
5 View the gallery
Technical limitations prevented researchers from determining whether doctors saw or used AI suggestions, so this study did not measure how AI influenced physician decisions. Follow-up research is underway.
The results show that AI can reach high levels of accuracy and have practical applications in medicine, Zeltzer added.
Still, there are many questions left. How should doctors and AI cooperate? When should recommendations be displayed? Do algorithms need to make decisions autonomously? What protection measures should be in place?
“The pace of innovation is fast, but it takes time to implement responsibly,” Selzer said. “As we go we may face new challenges. But it’s not difficult to imagine a future where algorithms can help flag key information, support decisions, and reduce artificial errors in medicine.”