reproducibilityindex.ai

MDAgents: An Adaptive Collaboration of LLMs for Medical Decision-Making

Authors: Yubin Kim, Chanwoo Park, Hyewon Jeong, Yik Siu Chan, Xuhai "Orson" Xu, Daniel McDuff, Hyeonhoon Lee, Marzyeh Ghassemi, Cynthia Breazeal, Hae Park

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate our framework and baseline methods using state-of-the-art LLMs across a suite of real-world medical knowledge and medical diagnosis benchmarks, including a comparison of LLMs medical complexity classification against human physicians2. MDAgents achieved the best performance in seven out of ten benchmarks on tasks requiring an understanding of medical knowledge and multi-modal reasoning, showing a significant improvement of up to 4.2% (p < 0.05) compared to previous methods best performances. Ablation studies reveal that MDAgents effectively determines medical complexity to optimize for efficiency and accuracy across diverse medical tasks.
Researcher Affiliation	Collaboration	1Massachusetts Institute of Technology 2Google Research 3Seoul National University Hospital
Pseudocode	Yes	Algorithm 1 Adaptive Medical Decision-making Framework
Open Source Code	Yes	Our code can be found at https://github.com/mitmedialab/MDAgents.
Open Datasets	Yes	To verify the effectiveness of our framework, we conduct comprehensive experiments with baseline methods on ten datasets including Med QA [35], Pub Med QA [36], DDXPlus [73], Sym Cat [2], JAMA [9], Med Bullets [9], Path-VQA [30], PMC-VQA [95], MIMIC-CXR [3] and Med Vid QA [29]. A detailed explanation and statistics for each dataset are deferred to Appendix A and Figure 8.
Dataset Splits	No	The paper states using '50 samples per dataset for testing' but does not explicitly provide training or validation dataset splits needed for reproducibility. It uses pre-trained LLMs with few-shot/zero-shot prompting.
Hardware Specification	No	The experiments primarily involved inference using API calls to GPT-3.5, GPT-4 (V), and Gemini-Pro (Vision). The type of compute workers, memory, and time of execution are managed by the API providers (Open AI and Gemini).
Software Dependencies	No	The paper mentions using LLMs like GPT-4(V), Gemini-Pro(Vision), GPT-4o mini, and GPT-3.5 via API calls, but does not specify any software dependencies with version numbers (e.g., Python, PyTorch, or other libraries).
Experiment Setup	Yes	We use 3-shot prompting for low-complexity cases and zero-shot prompting for moderate and high-complexity cases across all settings.