Assessment Of Advanced Large Language Models In Addressing Patient Questions On Corneal Refractive Surgery

Published 2025 - 43rd Congress of the ESCRS

Reference: PO1061 | Type: Poster | DOI: 10.82333/wh05-cz39

Authors: Tsung-Hsien Tsai* ¹

¹Department of Ophthalmology,Chang Gung Memorial Hospital,Keelung,Taiwan, Province of China

Purpose

This study assesses the performance of six state-of-the-art large language models (LLMs)—ChatGPT-4o, Claude 3.5 Sonnet, Gemini Advanced 1.5 Pro, Tongyi Qwen 2.5, ChatGPT o1, and DeepSeek-R1—in answering common patient inquiries about corneal refractive surgeries, including laser in situ keratomileusis (LASIK), kerato-refractive lenticule extraction, and photorefractive keratectomy (PRK), in both English and Chinese. The objective is to evaluate their accuracy, comprehensiveness, readability, and repeatability to inform equitable, multilingual patient education.

Setting

Online large language models

Methods

Standardized questions sourced from reputable ophthalmology resources were administered to each LLM. Three experienced refractive surgeons independently rated the responses for accuracy and comprehensiveness using a five-point Likert scale. Readability was assessed using validated instruments, while repeatability was evaluated by re-querying the models after a two-week interval. Statistical analyses included the post-hoc test was used to detect significant differences across models.

Results

ChatGPT o1 and DeepSeek-R1 demonstrated significantly higher accuracy and comprehensiveness in both English and Chinese than other models (Bonferroni-adjusted p < 0.001), with no significant difference between them. For English readability, Claude 3.5 Sonnet produced superior responses based on Flesch-Kincaid Grade Level, Gunning Fog Index, Coleman-Liau Index, SMOG Index, and Flesch Reading Ease (Bonferroni-adjusted p < 0.05). Conversely, ChatGPT-4o and Tongyi Qwen 2.5 had lower readability scores. For Chinese readability, ChatGPT-4o outperformed other models (Bonferroni-adjusted p < 0.05). Repeatability assessments showed no significant differences across the six models over time.

Conclusions

These findings highlight the potential of advanced LLMs to enhance patient education in corneal refractive surgery across multiple languages. While newer models such as ChatGPT o1 and DeepSeek-R1 excel in delivering accurate and comprehensive content, Claude 3.5 Sonnet and ChatGPT-4o provides more readable responses in English and Chines, respectively. The variability observed across models underscores the need for model-specific optimization to ensure accurate, comprehensive, and linguistically appropriate patient communication.