Hand Surg Rehabil. 2025 Mar 11:102122. doi: 10.1016/j.hansur.2025.102122. Online ahead of print.
ABSTRACT
INTRODUCTION: The American Academy of Orthopaedic Surgeons (AAOS) developed appropriate use criteria (AUC) to guide treatment decisions for distal radius fractures based on expert consensus. This study aims to evaluate the accuracy of Chat Generative Pre-trained Transformer-4o (ChatGPT-4o) by comparing its appropriateness scores for distal radius fracture treatment with those from the AUC.
METHODS: The AUC patient scenarios were categorized by factors such as fracture type (AO/OTA classification), mechanism of injury, pre-injury activity level, patient health (ASA 1-4), and associated injuries. Treatment options included percutaneous pinning, spanning external fixation, volar locking plates, dorsal plates, and immobilization methods, among others. Orthopedic surgeons assigned appropriateness scores for each treatment (1-3 = "Rarely Appropriate," 4-6 = "May Be Appropriate," and 7-9 = "Appropriate"). ChatGPT-4o was prompted with the same patient scenarios and asked to assign scores. Differences between AAOS and ChatGPT-4o ratings were used to calculate mean error, mean absolute error, and mean squared error. Statistical significance was assessed using Spearman correlation, and appropriateness scores were grouped into categories to determine percentage overlap between the two sources.
RESULTS: A total of 240 patient scenarios and 2160 paired treatment scores were analyzed. The mean error for treatment options ranged from 0.6 for volar locking plate to -2.9 for dorsal plating. Pearson correlation revealed significant positive associations for dorsal spanning bridge (0.43, P = <0.001) and spanning external fixation (0.4, P = <0.001). The percentage overlap between AAOS and ChatGPT-4o in the appropriateness categories varied, with 99.17% agreement for immobilization without reduction, 90.42% for volar locking plates, and only 15% for dorsal plating.
CONCLUSION: ChatGPT-4o does not consistently align with the appropriate use criteria in determining appropriate management of distal radius fractures. While there was moderate concordance in certain treatments, ChatGPT-4o tended to favor more conservative approaches, raising concerns about the reliability of AI-generated recommendations for medical advice and clinical decision-making.
PMID:40081807 | DOI:10.1016/j.hansur.2025.102122