Evaluating The Use Of Large Language Models As Decision-Support Tools For Determining Suitability For Toric Intraocular Lens Implantation

Published 2025 - 43rd Congress of the ESCRS

Reference: FP13.13 | Type: Free paper | DOI: 10.82333/j2p0-dg62

Authors: Inder Paul Singh* ¹ , Belen Lopez Murzi ²

¹The Eye Centers of Racine and Kenosha, Racine, WI, USA.,Wisconsin,United States, ²Panama Eye Center,Panama,Panama

Purpose

Large Language Models (LLMs) are being increasingly utilized in medicine. We evaluated the effectiveness of LLM chatbots, including GPT-3.5, GPT-4 (OpenAI), Gemini (Google), and Copilot (Microsoft), as clinical decision-support tools in cataract surgery for astigmatic patients who are candidates for Toric intraocular lens (IOL) implantation.

Setting

The study involved a retrospective review of patient cases admitted to the outpatient cataract clinics of a tertiary hospital.

Methods

116 eyes of patients undergoing routine cataract surgery evaluations were reviewed. Patients with astigmatism > 1.0D detected by the Tomey OA-2000 optical biometer underwent corneal tomography with the Oculus Pentacam HR. Outputs were anonymized and independently evaluated by two cornea specialists for Toric IOL suitability, with discrepancies resolved by a third cornea specialist. Preoperative measurements were also analyzed by GPT-3.5 and 4.0, Gemini, and Copilot, and their recommendations for Toric IOL suitability documented. Outcome measures included each chatbot’s accuracy relative to specialists, along with area under the ROC curve (AUC), sensitivity, specificity, positive and negative predictive values (PPV and NPV, respectively).

Results

81 out of 116 eyes (70%) were found suitable for Toric IOL by our ophthalmologists. ChatGPT-4.0 demonstrated an accuracy of 60%, with an AUC of 0.62, a PPV of 80%, a NPV of 40%, a sensitivity of 58%, and a specificity of 66%. In comparison, GPT-3.5 achieved an accuracy of 56%, an AUC of 0.62, a PPV of 83%, an NPV of 39%, a sensitivity of 47%, and a specificity of 77% (P<0.05). Both Gemini and Copilot exhibited inferior performance and accuracy, with AUC values of 0.54 and 0.53, respectively (P>0.05), and all other test measures for these chatbots were statistically insignificant (P>0.05). Additionally, Copilot had difficulty accurately interpreting biometry and tomography values.

Conclusions

This pioneering study found ChatGPT4.0 and 3.5 to have fair and reasonable performance in assessing Toric IOL suitability, while others underperformed. With the rapid advancement of technology, LLMs are continuously improving in their capabilities, holding significant potential to enhance clinical care as decision-support tool for cataract surgeons.