Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

RAD: Towards Trustworthy Retrieval-Augmented Multi-modal Clinical Diagnosis

Authors: Haolin Li, Tianjie Dai, Zhe Chen, Siyuan Du, Jiangchao Yao, Ya Zhang, Yanfeng Wang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive evaluations across four datasets with different anatomies demonstrate RAD s generalizability, achieving state-of-the-art performance. Furthermore, RAD enables the model to concentrate more precisely on abnormal regions and critical indicators, ensuring evidence-based, trustworthy diagnosis. Our code is available at this repository.
Researcher Affiliation	Academia	Haolin Li1,2 Tianjie Dai3 Zhe Chen2,3 Siyuan Du1,2 Jiangchao Yao2,3 Ya Zhang2,4,5 Yanfeng Wang2,4 1College of Computer Science and Artificial Intelligence, Fudan University 2Shanghai AI Laboratory 3CMIC, Shanghai Jiao Tong University 4School of Artificial Intelligence, Shanghai Jiao Tong University 5Institute of Artificial Intelligence for Medicine, Shanghai Jiao Tong University School of Medicine
Pseudocode	Yes	Algorithm 1 Guideline Recall 1: Input: Guideline G, text token sequence T, attention weights A, threshold θ 2: U Extract indicators from G 3: attended = 0, total = 0 4: for each u U do 5: Matched Tokens in T matching u 6: if Matched = then 7: total = total + 1 8: if mean(AMatched) > θ then 9: attended = attended + 1 10: return attended/total if total > 0 else 0
Open Source Code	Yes	Our code is available at this repository. The code is available at: https://github.com/tdlhl/RAD.
Open Datasets	Yes	We aligned MIMIC-CXR [28] and MIMIC-IV [29] to construct the MIMIC-ICD53 dataset, covering three modalities with 53 types of disease. We will release the dataset on Physio Net [42]. Harvard-Fair VLMed dataset [41], sourced from the Department of Ophthalmology at Harvard Medical School, contains 10,000 multimodal samples... The dataset is publicly available under the CC BY-NC-ND 4.0 license at Github4. Skin CAP is a multimodal dermatology dataset... It is publicly available under an open license at Hugging Face5.
Dataset Splits	Yes	The training set and test set are randomly divided in a ratio of 4:1. The final processed dataset is randomly divided in a ratio of 4:1 for training and testing. Harvard-Fair VLMed dataset [41], sourced from the Department of Ophthalmology at Harvard Medical School, contains 10,000 multimodal samples (7,000 train, 1,000 val, 2,000 test).
Hardware Specification	Yes	All experiments are conducted on a single NVIDIA A100 GPU.
Software Dependencies	No	The default backbone of the text image encoder is Clinical BERT [55] and Res Net-50 [22], respectively. In practice, we choose Qwen2.5-72B [64] as the LLM.
Experiment Setup	Yes	In practice, Top-k in Eq.(1) is set to 10. All guidelines obtained by Eq.(2) and the indicators used in Algorithm 1 are manually verified to avoid potential factual errors. The default backbone of the text image encoder is Clinical BERT [55] and Res Net-50 [22], respectively. The hyperparameters α and β, which serve as the balancing ratio between different losses, are set to be 1e 2 and 1e 1, respectively. All experiments are conducted on a single NVIDIA A100 GPU.