医疗不良事件人工智能分类与手动分类的对比研究——以DeepSeek大语言模型为例

王瑞; 谭旭彤; 赵从朴; 王书畅; 陈政; 马小军; 蔡志玲

doi:10.12290/xhyxzz.2025-0371

医疗不良事件人工智能分类与手动分类的对比研究——以DeepSeek大语言模型为例

A Comparative Study of Artificial Intelligence-based Classification Versus Manual Classification of Medical Adverse Events: Taking the DeepSeek Large Language Model As an Example

摘要

摘要:
目的分析人工智能(artificial intelligence, AI)在医疗不良事件分类中的应用价值。
方法回顾性收集2023年9月1日—2024年8月31日北京协和医院不良事件报告系统上报的医疗不良事件为研究对象。对符合标准的医疗不良事件数据脱敏后分别采用传统手动方式和AI大语言模型(DeepSeek-R1满血互联网版)对不良事件进行分类。统计两种方法在不良事件分类中的用时，并比较其分类结果一致性和差异性。以手动分类为金标准，对AI分类的准确性进行综合评价。
结果共纳入273例医疗不良事件进行分析。手动分类共耗时38 838 s，平均每例用时14.22 s；AI分类共耗时600 s，平均每例用时2.19 s。两种方法分类一致202例，不一致71例，总体符合率为73.99%, Kappa系数值为0.646(95% CI：0.575~0.717)，标准误差为0.0362。以手动分类为金标准，AI分类的准确率分布于80%~99%，精确度分布于30%~100%，召回率分布于40%~100%，F1分数分布于0.46~0.79，特异度分布于46%~98%。其中AI分类在器械类、药物类不良事件分类中各指标较为均衡且整体优异。
结论 DeepSeek大语言模型可协助提高医疗不良事件的分类效率，尤其在器械类、药物类不良事件分类中具有较好的应用潜力。

Abstract:
Objective To analyze the application value of artificial intelligence (AI)-based classification in the categorization of medical adverse events.
Methods Medical adverse events reported to the Adverse Event Reporting System of Peking Union Medical College Hospital from September 1, 2023, to August 31, 2024, were retrospectively collected as the study subjects. After de-identification of adverse events meeting the inclusion criteria, conventional manual classification and AI-based classification using a large language model (DeepSeek-R1 Full-Context Internet Edition) were performed. The time required for classification using both methods was recorded, and the consistency and discrepancies between the two methods were compared. Using manual classification as the gold standard, the accuracy of AI-based classification was comprehensively evaluated.
Results A total of 273 medical adverse events were analyzed. Manual classification took 38 838 seconds in total, with an average of 14.22 seconds per event. AI-based classification took 600 seconds in total, with an average of 2.19 seconds per event. The two methods showed consistent classification in 202 events and inconsistent classification in 71 events, yielding an overall agreement rate of 73.99% and a Kappa coefficient of 0.646 (95% CI: 0.575-0.717), with a standard error of 0.0362. Using manual classification as the gold standard, AI-based classification achieved accuracy ranging from 80% to 100%, precision from 30% to 100%, recall from 40% to 100%, F1 scores from 0.46 to 0.79, and specificity from 46% to 98%. Notably, AI-based classification demonstrated balanced and overall excellent performance in the categorization of device-related and drug-related adverse events.
Conclusion The DeepSeek large language model can assist in improving the efficiency of medical adverse event classification, showing promising application potential, particularly in the categorization of device-related and drug-related adverse events.

HTML全文

参考文献(16)

施引文献

资源附件(0)