Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
CareBot: A Pioneering Full-Process Open-Source Medical Language Model
Authors: Lulu Zhao, Weihao Zeng, Xiaofeng Shi, Hua Zhou
AAAI 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our rigorous evaluations on Chinese and English benchmarks confirm Care Bot s effectiveness in medical consultation and education. These advancements not only address current limitations in medical LLMs but also set a new standard for developing effective and reliable open-source models in the medical domain. |
| Researcher Affiliation | Academia | 1Beijing Academy of Artificial Intelligence (BAAI) 2School of Artificial Intelligence, Beijing University of Posts and Telecommunications |
| Pseudocode | No | The paper describes the methodology in narrative text and flowcharts, but does not include any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code https://github.com/Flag Open/Care Bot |
| Open Datasets | Yes | Our SFT dataset comprises a diverse array of question types, including multiple-choice questions from medical exams, single-turn disease diagnoses, and multi-turn health consultations. It integrates data from seven publicly available sources: Chinese Medical Dialogue Data3, Huatuo26M (Li et al. 2023a), Med Dialog (Zeng et al. 2020), Chat Med Consult Dataset (Tian et al. 2023), Chat Doctor (Li et al. 2023b), CMB4, and Med QA (Jin et al. 2021). |
| Dataset Splits | No | The paper describes data collection and filtering for constructing training datasets, and mentions using existing benchmark test sets for evaluation (e.g., Huatuo26M-test, CMt Med QA, CMB-Clin). However, it does not explicitly provide details about the training, validation, and test splits used for its own model training and fine-tuning process. |
| Hardware Specification | No | The paper describes the training process and evaluation of the model but does not provide specific details regarding the hardware used (e.g., GPU models, CPU types, memory). |
| Software Dependencies | No | The paper mentions using certain models and tools like LLaMA3-8B, GPT-4, and bge-m3, but does not specify software dependencies such as programming language versions, libraries, or frameworks with their specific version numbers required to replicate the experimental setup. |
| Experiment Setup | No | The paper describes data ratios and token distributions for the CPT stages (e.g., 'combine a high-quality medical pre-training corpus with general data via the ratio as 19:1, with a token-level distribution of 1:9 for Chinese:English'). It also mentions using Direct Preference Optimization (DPO). However, it lacks specific experimental setup details such as learning rates, batch sizes, number of epochs, optimizers, or other hyperparameter values that are crucial for full reproducibility. |