生成式人工智能在重症医学住院医师规范化培训考核中的应用探索

The Application of Generative Artificial Intelligence in the Assessment of Critical Care Medicine for Standardized Resident Physician Training

  • 摘要: 目的 探索生成式人工智能(generative artificial intelligence,GAI)在重症医学住院医师规范化培训考核中的应用效果。方法 以2024年6—9月在北京协和医院、北京友谊医院重症医学科进行规范化培训的住院医师和具有规范化培训教师资格的教学医师作为研究对象。采用通义千问2.5版本生成的2套GAI试卷以及人工生成的1套试卷对所有住院医师进行考核,答案由教学医师和通义千问2.5版本分别进行批阅评分。对比人工和GAI批阅评分结果,并收集住院医师和教学医师对GAI试卷与人工试卷的评价情况。结果 共35名住院医师和11名教学医师纳入本研究。住院医师2套GAI试卷的单项选择题得分均高于人工试卷(P均<0.05),但多项选择题得分均低于人工试卷(P均<0.05),三套试卷简答题的GAI批阅评分与人工批阅评分差异均无统计学意义(P均>0.05)。主观评价方面,教学医师(P=0.007)和住院医师(P=0.008)均认为GAI试卷难度较低,但在内容准确性、与培训大纲一致性方面与人工试卷差异均无统计学意义(P均>0.05)。结论 GAI在生成试卷和批阅评分方面与人工试卷相当,但在试题难度方面有待进一步优化。GAI有望成为提高住院医师教学考核效率的重要工具。

     

    Abstract: Objective To explore the application effectiveness of generative artificial intelligence (GAI) in the standardized training assessment of critical care medicine residents. Methods The study subjects included residents undergoing standardized training in the critical care medicine departments of Peking Union Medical College Hospital and Beijing Friendship Hospital from June to September 2024, as well as teaching physicians qualified for standardized training instruction. Two sets of GAI-generated examination papers (using Tongyi Qianwen 2.5) and one set of human-generated examination papers were administered to all residents. The answers were graded separately by teaching physicians and Tongyi Qianwen 2.5. The grading results from human and GAI evaluations were compared, and feedback from both residents and teaching physicians on the GAI-generated and humangenerated papers was collected. Results A total of 35 residents and 11 teaching physicians were included in the study. The scores of residents on single-choice questions from the two GAI-generated papers were significantly higher than those from the human-generated paper (both P<0.05), while the scores on multiple-choice questions were significantly lower (both P<0.05). There were no statistically significant differences in the grading of shortanswer questions among the three papers (all P>0.05). In terms of subjective evaluations, both teaching physicians (P= 0 0.007) and residents (P= 0 0.008) perceived the GAI-generated papers as less difficult. However, there were no significant differences in content accuracy or alignment with the training syllabus between the GAI-generated and human-generated papers(all P>0.05). Conclusions GAI performs comparably to human-generated papers in terms of examination paper creation and grading, but further optimization is needed regarding question difficulty. GAI holds promise as a valuable tool for enhancing the efficiency of resident teaching assessments.

     

/

返回文章
返回