Evaluating the Performance of ChatGPT4.0 Versus ChatGPT3.5 on the Hand Surgery Self-Assessment Exam: A Comparative Analysis of Performance on Image-Based Questions

Scritto il 17/02/2025
da Kiera L Vrindten

Cureus. 2025 Jan 16;17(1):e77550. doi: 10.7759/cureus.77550. eCollection 2025 Jan.

ABSTRACT

Hypothesis The emergence of ChatGPT as an artificial intelligence (AI) platform has become an increasingly useful tool in medical education, especially within resident education to supplement preparation for certification exams. As the AI model inevitably progresses, there is an increased need to establish ChatGPT's accuracy in specialty knowledge. Our study assesses the performance of ChatGPT4.0 on self-assessment questions pertaining to hand surgery in comparison to the performance of its predecessor ChatGPT3.5. A distinct feature of ChatGPT4.0 is its ability to interpret visual input which ChatGPT3.5 cannot. We hypothesize that ChatGPT4.0 will perform better on image-based questions than ChatGPT3.5. Methods This study used 10 self-assessment exams from 2004 to 2013 from the American Society for Surgery of the Hand (ASSH). Performance on image-based questions was compared between ChatGPT4.0 and ChatGPT3.5. The primary outcome was the total score as a proportion of answers correct. Secondary outcomes were the proportion of questions for which ChatGPT4.0 provided elaborations, the length of those elaborations, and the number of questions for which ChatGPT4.0 provided answers with confidence. Descriptive analysis, Student's t-test, and one-way ANOVA tests were used for data analysis. Results Out of 455 image-based questions, there was no statistically significant difference in the total score between ChatGPT4.0 and ChatGPT3.5. ChatGPT4.0 answered 137 (30.1%) questions correctly while ChatGPT3.5 answered 131 (28.7%) correctly (p= 0.805). Although there was no significant difference in the length or frequency of elaborations in relation to the proportion of correct answers between the two versions, ChatGPT4.0 did provide significantly longer explanations overall compared to ChatGPT3.5 (p<0.05). Moreover, of the 455 total image-based questions, ChatGPT4.0 provided significantly less confident answers compared to ChatGPT3.5 (p<0.05). Of those responses in which ChatGPT4.0 expressed uncertainty, there was a significant difference based on image type, with the highest uncertainty stemming from question stems involving radiograph-based images (p<0.001). Summary points Overall, there was no significant difference in performance between ChatGPT4.0 and ChatGPT3.5 when answering image-based questions on the ASSH self-assessment examinations. Notably, however, ChatGPT4.0 expressed more uncertainty with answers. Further exploration of how AI-generated responses influence user behavior in clinical and educational settings will be crucial to optimizing the role of AI in healthcare.

PMID:39958041 | PMC:PMC11829751 | DOI:10.7759/cureus.77550