Is The Use Of A General Large Language Model Helpful To Ophthalmology Residents On Call?

Published 2025 - 43rd Congress of the ESCRS

Reference: FP10.13 | Type: Free paper | DOI: 10.82333/d30m-h488

Authors: Meiyan Li* ¹ , Chi Zhang ¹ , Xingtao Zhou ¹

¹Eye Institute and Department of Ophthalmology,Eye＆ENT Hospital,Shanghai,China

Purpose

This study aims to evaluate the diagnostic performance and management accuracy of a large language model (LLM), ChatGPT-4, for acute cases referred to ophthalmology in an Emergency Department (ED), and ChatGPT-4’s utility to ophthalmological residents in this setting.

Setting

This study was conducted in the Emergency Department (ED) of a tertiary hospital in Singapore, evaluating the performance of ChatGPT-4 in assisting with first and second year ophthalmology residents with acute ophthalmological cases referred to them in the ED.

Methods

Over the first four weeks of residency, clinical vignettes of patients seen by both first-year and experienced ophthalmological residents were presented to ChatGPT-4. The model’s diagnostic accuracy, generation of top three differentials and management plans were compared against those of the first and second year residents, against a ground truth of responses from senior residents who have completed at least three years of residency training and obtained the Master of Medicine in Ophthalmology [MMed (Ophth)]. The residents and ChatGPT-4 were blinded to each others' responses.

Results

129 clinical vignettes, consisting of 38 general ophthalmology cases, 37 corneal cases, 33 vitreoretina cases, 7 oculoplastic cases, 10 neuro-ophthalmology cases and 4 glaucoma cases, were presented. ChatGPT-4 attained a 78.4% diagnostic accuracy, outperformed by the residents’ combined diagnostic accuracy of 94.6% (p<0.001). ChatGPT-4 achieved a 93.0% accuracy of having the correct diagnosis within its top three differentials, matching that of residents at 98.4% (p=0.159), and on average generated 57.6% of the correct top 3 differentials, comparable to the residents’ average of 58.0% (p=0.546). ChatGPT-4 obtained a mean management score of 67.8%, while residents achieved a significantly higher management score of 76.9% (p value <0.001).

Conclusions

ChatGPT-4 shows reasonable diagnostic accuracy and comparable performance in generating differentials and management plans to that of ophthalmological residents for acute ophthalmological presentations. Further development of LLM with medical datasets may improve its synthesis of management plans and increase its utility for residents-in-training.