Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
MMed-RAG: Versatile Multimodal RAG System for Medical Vision Language Models
Authors: Peng Xia, Kangyu Zhu, Haoran Li, Tianze Wang, Weijia Shi, Sheng Wang, Linjun Zhang, James Y Zou, Huaxiu Yao
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results across five medical datasets (involving radiology, ophthalmology, pathology) on medical VQA and report generation demonstrate that MMed-RAG can achieve an average improvement of 43.8% in the factual accuracy of Med-LVLMs. |
| Researcher Affiliation | Academia | 1UNC-Chapel Hill, 2Brown University, 3Carnegie Mellon University, 4Rutgers University, 5University of Washington, 6Stanford University |
| Pseudocode | Yes | Algorithm 1: Versatile Multimodal RAG System (MMed-RAG) |
| Open Source Code | Yes | Our data and code are available in https://github.com/richard-peng-xia/MMed-RAG. |
| Open Datasets | Yes | We utilize five medical vision-language datasets for medical VQA and report generation tasks, i.e., MIMIC-CXR (Johnson et al., 2019), IU-Xray (Demner-Fushman et al., 2016), Harvard-Fair VLMed (Luo et al., 2024), PMC-OA (Lin et al., 2023a) (we only select the pathology part) and Quilt-1M (Ikezogwo et al., 2024). |
| Dataset Splits | No | The paper provides data statistics for training the retriever and RAG-PT (Tables 6, 7), and total numbers of images and QA items for evaluation datasets (Table 8). However, it does not explicitly provide the train/test/validation splits for the five medical datasets used for medical VQA and report generation tasks (MIMIC-CXR, IU-Xray, Harvard-Fair VLMed, PMC-OA, Quilt-1M) to reproduce the experiment's evaluation phase. |
| Hardware Specification | Yes | Training for 20 hours on one A100 80G GPU. For the first phase, we trained for 3 epochs, and for the second phase, the training was conducted for 1 epoch. Training for 20 hours on one A100 80G GPU. All the experiments are implemented on Py Torch 2.1.2 using four NVIDIA RTX A6000 GPUs. |
| Software Dependencies | Yes | All the experiments are implemented on Py Torch 2.1.2 using four NVIDIA RTX A6000 GPUs. |
| Experiment Setup | Yes | We use the Adam W optimizer with a learning rate of 10 3, weight decay of 10 2 and a batch size of 32. The model is trained for 360 epochs. For the first phase, we trained for 3 epochs, and for the second phase, the training was conducted for 1 epoch. For the RAG-PT phase, we adjust the diffusion noise level, symbolized by ΞΎ through a specific formula: ΞΎ = Sigmoid(lt) (0.5 10 2 10 5) + 10 5, where Ο΅ is drawn from a normal distribution. In our experiments, we apply cross-validation to tune all hyperparameters with grid search. |