MDAgents: An Adaptive Collaboration of LLMs for Medical Decision-Making
Authors: Yubin Kim, Chanwoo Park, Hyewon Jeong, Yik Siu Chan, Xuhai "Orson" Xu, Daniel McDuff, Hyeonhoon Lee, Marzyeh Ghassemi, Cynthia Breazeal, Hae Park
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our framework and baseline methods using state-of-the-art LLMs across a suite of real-world medical knowledge and medical diagnosis benchmarks, including a comparison of LLMs medical complexity classification against human physicians2. MDAgents achieved the best performance in seven out of ten benchmarks on tasks requiring an understanding of medical knowledge and multi-modal reasoning, showing a significant improvement of up to 4.2% (p < 0.05) compared to previous methods best performances. Ablation studies reveal that MDAgents effectively determines medical complexity to optimize for efficiency and accuracy across diverse medical tasks. |
| Researcher Affiliation | Collaboration | 1Massachusetts Institute of Technology 2Google Research 3Seoul National University Hospital |
| Pseudocode | Yes | Algorithm 1 Adaptive Medical Decision-making Framework |
| Open Source Code | Yes | Our code can be found at https://github.com/mitmedialab/MDAgents. |
| Open Datasets | Yes | To verify the effectiveness of our framework, we conduct comprehensive experiments with baseline methods on ten datasets including Med QA [35], Pub Med QA [36], DDXPlus [73], Sym Cat [2], JAMA [9], Med Bullets [9], Path-VQA [30], PMC-VQA [95], MIMIC-CXR [3] and Med Vid QA [29]. A detailed explanation and statistics for each dataset are deferred to Appendix A and Figure 8. |
| Dataset Splits | No | The paper states using '50 samples per dataset for testing' but does not explicitly provide training or validation dataset splits needed for reproducibility. It uses pre-trained LLMs with few-shot/zero-shot prompting. |
| Hardware Specification | No | The experiments primarily involved inference using API calls to GPT-3.5, GPT-4 (V), and Gemini-Pro (Vision). The type of compute workers, memory, and time of execution are managed by the API providers (Open AI and Gemini). |
| Software Dependencies | No | The paper mentions using LLMs like GPT-4(V), Gemini-Pro(Vision), GPT-4o mini, and GPT-3.5 via API calls, but does not specify any software dependencies with version numbers (e.g., Python, PyTorch, or other libraries). |
| Experiment Setup | Yes | We use 3-shot prompting for low-complexity cases and zero-shot prompting for moderate and high-complexity cases across all settings. |