ESCRS - FP10.12 - Refractive Surgery And Artificial Intelligence: How Reliable Are Chatgpt’S Decisions?

Refractive Surgery And Artificial Intelligence: How Reliable Are Chatgpt’S Decisions?

Published 2025 - 43rd Congress of the ESCRS

Reference: FP10.12 | Type: Free paper | DOI: 10.82333/fmrk-s981

Authors: Otman Sandali* 1 , Rachid Tahiri Joutei Hassani 2 , Vincent Gualino 3 , Isabelle Audo 4 , Vincent Borderie 4

1GUILLAUME DE VARYE CLINIC,Bourges,France, 2GRANVILLE HOSPITAL,GRANVILLE,France, 3CLINIQUE DU DR CAVE ,MONTAUBAN,France, 415 20 hospital,Paris ,France

Purpose

To compare the surgical recommendations of refractive surgery (RS) specialists and artificial intelligence (AI, ChatGPT-4o) for patients evaluated for RS procedures, including photorefractive keratectomy (PRK), laser in situ keratomileusis (LASIK), and keratorefractive lenticule extraction (KLEx).

Setting

Beyoglu Eye Training and Research Hospital

Methods

This retrospective study includes 44 eyes of 22 patients who were either deemed eligible or ineligible for PRK, LASIK, and KLEx by RS specialists. Collected data included demographic information, uncorrected and corrected distance visual acuity, cycloplegic autorefraction values, Sirius (CSO, Italy) corneal topography reports, scotopic, mesopic, and photopic pupil diameters, axial lengths, and endothelial cell counts. Examination findings and corneal topography data were analyzed by the ChatGPT-4o model, and its surgical recommendations were recorded. The agreement between ChatGPT-4o and RS specialists, as well as patient safety and clinical accuracy, were assessed. The agreement was analyzed via Cohen’s Kappa (κ) analysis.

Results

The mean age was 27.2±6.4 years of patients (12 female, 10 male). Eight eyes were hyperopic, and 36 were myopic. Regarding surgical indications, RS specialists recommended surgery for 63.6% of eyes, whereas ChatGPT-4o recommended surgery for 93.2% of them. When comparing the surgical methods suggested, 15 eyes had identical recommendations from both evaluators, whereas 12 eyes had different surgical recommendations from ChatGPT-4o and the specialists. Notably, ChatGPT-4o suggested surgery for 14 eyes that were deemed unsuitable by specialists. The kappa coefficient was calculated as 0.194, indicating poor agreement between the AI model and the specialists. The standard error of the kappa value was 0.087, and the T-value was 2.513 (p=0.01).

Conclusions

This study reveals that ChatGPT-4o exhibits a low level of agreement with RS specialists (κ=0.194, p=0.01). The AI model tends to recommend surgery more frequently than specialists and has suggested surgical procedures for cases deemed unsuitable by them. This raises concerns regarding patient safety and the clinical reliability of AI-based decision-making. While AI-assisted systems hold potential in RS evaluations, they should only be used under specialist supervision to ensure safe and effective clinical decision-making.