Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

QoQ-Med: Building Multimodal Clinical Foundation Models with Domain-Aware GRPO Training

Authors: David Dai, Peilin Chen, Chanakya Ekbote, Paul Liang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We design experiments to answer the following research questions. Details are included in App. D. RQ1: How does DRPO compare with other critic-free RL methods and models? As detailed in Sec. 3.2, we train and evaluate Qo Q-Med on a combination of 30 clinical diagnosis datasets across 9 clinical domains. ... RQ2: How well does DRPO handle mixed multimodal inputs? We repeat the comparison on MIMIC-IV... RQ3: How is the quality of the reasoning traces and bounding boxes learned by DRPO? We did both a qualitative and a quantitative analysis on Qo Q-Med s reasoning and bounding box outputs.
Researcher Affiliation	Academia	Wei Dai, Peilin Chen, Chanakya Ekbote, Paul Pu Liang MIT Media Lab and MIT EECS EMAIL
Pseudocode	No	The paper describes the DRPO algorithm in detail in Section 3.3 and its sub-sections (Domain-aware Relative Policy Optimization (DRPO), Hierarchical Cluster-Based Scaling, DRPO Objective), including mathematical formulations, but it does not present a formal 'Pseudocode' or 'Algorithm' block.
Open Source Code	Yes	To foster reproducibility and downstream research, we release (i) the full model weights, (ii) the modular training pipeline, and (iii) all intermediate reasoning traces at this link. Finally, we publicly release our model, training pipeline, and reasoning traces generated by the model across 2.61 million question-answer pairs at this link. We release our repository containing the code used for all experiments. We also include all the datasets we used. We open source our training pipeline, model weights and training hyperparameters. The dataset used in our model is fully public, with little to no license restrictions.
Open Datasets	Yes	Training Data. We train the unified vision and time-series model across 33 datasets using the CLIMB dataset [22]. The dataset contains 2.61 million samples across 1D (ECG), 2D (Chest X-ray, Mammography, Dermoscopy, histopathology, Fundus), and 3D (Ultrasound, MRI, CT Scan) data. The exact composition of the data and the training hyperparameters are included in App. C and D. C. Details of the Datasets used in Training and Validation We use the CLIMB dataset to train our Qo Q-Med model. CLIMB is a multimodal clinical diagnosis dataset introduced in Dai et al. [22]. It contains a mixture of 44 publicly available datasets across 13 domains. In this work, we use the vision (2D and 3D) and ECG subset of the CLIMB dataset, which contains 707K 2D, 1.83M 3D, and 78.9K ECG data. A list of datasets used in the paper is included in Table 6.
Dataset Splits	Yes	For RQ1, we use the same training/validation split as in the original CLIMB dataset, which largely inherits the splits from the original papers. On the MIMIC-IV dataset, the model has to reason across ECGs, chest X-rays, and health records. ... LOS prediction is formulated as a 4-class classification problem. Patient stays are binned into the following categories: Class A: 0-4 days Class B: 5-8 days Class C: 9-12 days Class D: more than 12 days 48-IHM is formulated as a binary classification task. A positive label is assigned if either: The patient s death date is within 48 hours of admission, or The patient is discharged to hospice care within 48 hours of admission
Hardware Specification	Yes	The models are trained for 1 epoch on an 8x NVIDIA A100 and H200 GPU instances. ... The training of 32B model takes more than 2 weeks to train on an 8x A100 machine, so a 8x H200 machine is used to speed up the training process of 32B model via faster interconnect.
Software Dependencies	No	We build our training pipeline based on the FSDP and Ve RL framework, with v LLM to speed up reasoning training with KV Cache. We use a learning rate of 1e-6, a weight decay of 1e-2, and a KL coefficient of 1e-4. We use Adam W full model training at 32-bit precision for all 7B models, and at 16-bit precision for the training of the 32B model. ... For all RL training methods, we use Qwen2.5-VL-7B [9] as the base model. For Qv Q-Med, an ECG encoder named ECG-JEPA [43] is prepended. The paper mentions several software frameworks (FSDP, Ve RL, vLLM) and models (Qwen2.5-VL-7B, ECG-JEPA) used in the training pipeline but does not provide specific version numbers for these, nor for broader dependencies like Python, PyTorch, or CUDA.
Experiment Setup	Yes	Unless mentioned otherwise, we use the same set of hyperparameters to train the model across different training methods. The models are trained for 1 epoch on an 8x NVIDIA A100 and H200 GPU instances. For 7B models, we use a per-device batch size of 4, and a rollout batch size of 512. The maximum context length is 8192. To ensure consistency throughout the training, we shuffle the data with seed 42 beforehand, and disable shuffling throughout the training process. To save compute, we employ early stopping, which stops training when the accuracy converges and stops improving. Most 7B model trainings converge within 2 days of training. ... We use a learning rate of 1e-6, a weight decay of 1e-2, and a KL coefficient of 1e-4. We use Adam W full model training at 32-bit precision for all 7B models, and at 16-bit precision for the training of the 32B model.