MedBench: A Large-Scale Chinese Benchmark for Evaluating Medical Large Language Models
Authors: Yan Cai, Linlin Wang, Ye Wang, Gerard de Melo, Ya Zhang, Yanfeng Wang, Liang He
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We perform extensive experiments and conduct an in-depth analysis from diverse perspectives, which culminate in the following findings: (1) Chinese medical LLMs underperform on this benchmark, highlighting the need for significant advances in clinical knowledge and diagnostic precision. (2) Several general-domain LLMs surprisingly possess considerable medical knowledge. We perform extensive experiments and offer detailed analyses to provide diverse perspectives for evaluating clinical knowledge recall and reasoning capabilities of LLMs across a range of branches of medicine. |
| Researcher Affiliation | Academia | Yan Cai1, Linlin Wang1,2*, Ye Wang1, Gerard de Melo3,4, Ya Zhang2,5, Yanfeng Wang2,5, Liang He1 1East China Normal University 2 Shanghai Artificial Intelligence Laboratory 3 Hasso Plattner Institute 4 University of Potsdam 5 Shanghai Jiao Tong University |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper provides links to third-party models and resources (e.g., 'https://github.com/baichuan-inc/Baichuan-13B.', 'https://github.com/michael-wzhu/Chat Med.') but does not state that the code for their own methodology or benchmark is open-source or provide a link to it. |
| Open Datasets | No | The paper introduces a new benchmark, Med Bench, stating it comprises '40,041 questions sourced from authentic examination exercises and medical reports' and 'exclusively sourced from the latest validated exams and expert-annotated EHRs'. It also mentions collecting 'representative exercises from the Chinese Medical Licensing Exam (CNMLE), Resident Standardization Training Exam, and Doctor in-charge Qualification Exam'. However, it does not provide a specific link, DOI, or repository for public access to the Med Bench dataset itself. |
| Dataset Splits | No | The paper states, 'We conduct extensive experiments to evaluate the five-shot performance of LLMs, ensuring their capability to respond in a multiple-choice format.' and 'Furthermore, we partition the Med Bench dataset based on exams, medical subdiscipline, and question types, and perform independent testing on each subset to enable a comprehensive analysis.' While it describes partitioning for analysis, it does not specify explicit training/validation/test dataset splits for reproducing its own experiments. |
| Hardware Specification | No | The paper states, 'The computations were performed in the ECNU Multifunctional Platform for Innovation (001).', but does not provide specific hardware details such as GPU or CPU models. |
| Software Dependencies | No | The paper mentions evaluating various LLMs (e.g., 'Chat GPT', 'Chat GLM', 'Baichuan-13B', 'Hua Tuo', and 'Chat Med') and using their APIs or local deployments. However, it does not provide specific version numbers for underlying software dependencies or libraries. |
| Experiment Setup | Yes | We conduct extensive experiments to evaluate the five-shot performance of LLMs, ensuring their capability to respond in a multiple-choice format. We leverage the API for Chat GPT5 and opt for local deployment to facilitate evaluations for other LLMs. |