Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

CareBot: A Pioneering Full-Process Open-Source Medical Language Model

Authors: Lulu Zhao, Weihao Zeng, Xiaofeng Shi, Hua Zhou

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our rigorous evaluations on Chinese and English benchmarks conﬁrm Care Bot s effectiveness in medical consultation and education. These advancements not only address current limitations in medical LLMs but also set a new standard for developing effective and reliable open-source models in the medical domain.
Researcher Affiliation	Academia	1Beijing Academy of Artiﬁcial Intelligence (BAAI) 2School of Artiﬁcial Intelligence, Beijing University of Posts and Telecommunications
Pseudocode	No	The paper describes the methodology in narrative text and flowcharts, but does not include any structured pseudocode or algorithm blocks.
Open Source Code	Yes	Code https://github.com/Flag Open/Care Bot
Open Datasets	Yes	Our SFT dataset comprises a diverse array of question types, including multiple-choice questions from medical exams, single-turn disease diagnoses, and multi-turn health consultations. It integrates data from seven publicly available sources: Chinese Medical Dialogue Data3, Huatuo26M (Li et al. 2023a), Med Dialog (Zeng et al. 2020), Chat Med Consult Dataset (Tian et al. 2023), Chat Doctor (Li et al. 2023b), CMB4, and Med QA (Jin et al. 2021).
Dataset Splits	No	The paper describes data collection and filtering for constructing training datasets, and mentions using existing benchmark test sets for evaluation (e.g., Huatuo26M-test, CMt Med QA, CMB-Clin). However, it does not explicitly provide details about the training, validation, and test splits used for its own model training and fine-tuning process.
Hardware Specification	No	The paper describes the training process and evaluation of the model but does not provide specific details regarding the hardware used (e.g., GPU models, CPU types, memory).
Software Dependencies	No	The paper mentions using certain models and tools like LLaMA3-8B, GPT-4, and bge-m3, but does not specify software dependencies such as programming language versions, libraries, or frameworks with their specific version numbers required to replicate the experimental setup.
Experiment Setup	No	The paper describes data ratios and token distributions for the CPT stages (e.g., 'combine a high-quality medical pre-training corpus with general data via the ratio as 19:1, with a token-level distribution of 1:9 for Chinese:English'). It also mentions using Direct Preference Optimization (DPO). However, it lacks specific experimental setup details such as learning rates, batch sizes, number of epochs, optimizers, or other hyperparameter values that are crucial for full reproducibility.