Evaluating The Accuracy, Readability, And Reliability Of Ai-Generated Patient Information Leaflets For Descemet Membrane Endothelial Keratoplasty (Dmek)

Published 2025 - 43rd Congress of the ESCRS

Reference: PO524 | Type: Free paper | DOI: 10.82333/9znd-t006

Authors: Alexander Franchi ¹ , Christoph Palme* ¹ , Victoria Stöckl ¹ , Nadja Franz ¹ , Barnabas Kremser ¹ , Paolo Bonatti ¹ , Bernhard Steger ¹

¹Augenheilkunde und Optometrie,Medizinische Universität Innsbruck,Innsbruck,Austria

Purpose

This study assessed the accuracy, readability, and reliability of patient information leaflets on Descemet Membrane Endothelial Keratoplasty (DMEK) generated by eight AI large language models (LLMs) – (Chat GPT4, ChatGPT-4o, Microsoft Co Pilot, DeepSeek-V3, Google Gemini 2.0 Flash, Claude 3.7 Sonnet, Perplexity AI). The aim was to determine which LLM produced the most patient-friendly, comprehensible and evidence-based leaflet. Readability metrics, health literacy metrics, and misinformation detection provided an objection comparison of various LLM models against a comparator leaflet (Royal Free London NHS Foundation Trust).

Setting

Not Applicable

Methods

Each AI-generated leaflet was assessed using the prompt ‘Make a patient information leaflet on Descemet Membrane Endothelial Keratoplasty (DMEK) surgery’. Readability metrics (Flesch-Kincaid Grade Level (FKG), Flesch Reading Ease (FRE), Automated Readability Index (ARI), Gunning Fog (GF)), reliability metrics (DISCERN, PEMAT), misinformation detection and reference analysis were recorded for each response. Word count was used to determine comprehensiveness. Comparative analysis identified variations in readability, reliability and accuracy. A weighted scoring system was then developed, normalising scores across a 0-100% scale.

Results

The comparator leaflet scored the highest overall score (95%) as expected. Among our LLM leaflets, Claude 3.7 Sonnet had the highest overall score of 90%. ChatGPT-4o notably scored highest in readability (FRE = 50, PEMAT 100%), but lacked references. However, Claude 3.7 Sonnet and Perplexity AI provided the most references (2+ references). Microsoft CoPilot achieved the lowest DISCERN score (49) and word count (255 words), making it the least comprehensive LLM leaflet. Misinformation was also flagged in Microsoft CoPilot and Perplexity AI.

Conclusions

No single LLM excelled in all metrics, indicating variability of readability, reliability and misinformation risk. Some models, ChatGPT-4o and Claude 3.7 Sonnet, provide clear and comprehensible material whilst others (Perplexity LLM and Microsoft CoPilot) lacked accuracy and readability. Concerns regarding LLM’s reliability of providing accurate and up to date information remain due to lack of standardised medical sources. This means continuing to manually validate information against established healthcare sources (NHS, AAO) for the sake of accuracy and completeness. However, LLMs have shown evidence of improvement in the continuous development of newer models. This shows promise for generating patient education materials in the future.