Enhancing Disease Diagnosis - Can Artificial Intelligence Help?

Published 2024 - 42nd Congress of the ESCRS

Reference: PP18.04 | Type: Free paper | DOI: 10.82333/mnfh-b235

Authors: Hanna Popiela* ¹ , Jacob Adams ¹ , Antony Wilby ² , Ramez Borbara ³ , Magdalena Z Popiela ²

¹Wroclaw Medical University,Wroclaw ,Poland, ²Ophthalmology ,University Hospital of Wales, Cardiff,Cardiff,United Kingdom, ³Ophthalmology ,Hillel Yaffe Medical Centre,Hadera,Israel

Purpose

The aim was to establish if artificial intelligence Large Language Models (LLMs) can accurately state a diagnosis given descriptive patient data from an open source ophthalmic patient database on the University of Iowa’s website. Additionally, we tested if there is a significant difference between using different AI systems: OpenAIs ChatGPT compared to Google's Gemini LLM.

Setting

Thd Study was carried out in an online setting using an open source ophthalmic database from the University of Iowa website. Authors are affiliated with Wrocław Medical University, Poland, University Hospital of Wales, Cardiff and Ophthalmology department, Hillel Yaffe Medical Center, Hadera, Israel.

Methods

We compared the diagnostic accuracy of 4 Large Language Models(LLMs); ChatGPT3.5, ChatGPT4, Gemini and Gemini Advanced by inserting the same prompt with patient information. Patient information was obtained from the open database of the University of Iowa's website- EyeRounds.org. We asked the LLMs for the main diagnosis, differential diagnoses and the proposed treatment, which we then compared with the real life patient information provided within the University of Iowa's database. 20 cases were selected at random; all related to anterior segment diseases, such as Acanthaoemba Keratitis, Viral Conjunctivitis and Acid Burns. We calculated the percentage of cases LLMs were able to correctly diagnose and compared them accordingly.

Results

All LLMs were able to accurately state the main diagnosis in at least 60% of cases. The ChatGPT 3,5 and ChatGPT 4,0 had lowest accuracy rates with each reaching only 60%. Gemini was able to state correct diagnosis in 71,4% cases and Gemini Advance reached 88.9% diagnostic accuracy. In 6 out of 20 clinical cases Gemini answered by saying “I am not a medical doctor. I'm an AI language model and cannot provide a diagnosis you should act on." Whereas Gemini Advanced answered like this in 2 out of the 20 clinical cases. Those cases were excluded from the accuracy calculations. ChatGPT did not state such a reply. Differential diagnoses and treatment were appropriate for the answers.

Conclusions

Use of Large Language Models can enhance anterior segment disease diagnosis. However, it is largely dependent on the quality of information fed into the system. The newly released Gemini Advanced had the highest diagnostic accuracy in our cohort and additionally, it recognised its diagnostic limitations. When prompted it offered a comprehensive list of differential diagnoses and treatment options appropriate for the main diagnosis stated in all cases.