MediQ: Question-Asking LLMs and a Benchmark for Reliable Interactive Clinical Reasoning
Authors: Stella Li, Vidhisha Balachandran, Shangbin Feng, Jonathan Ilgen, Emma Pierson, Pang Wei W. Koh, Yulia Tsvetkov
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct experiments to validate each component of MEDIQ. First, we evaluate the Patient system with factuality and relevance metrics ( 3.1). Then, we establish the correlation between information availability and accuracy by studying model performance with varying levels of input information ( 3.2.1). Finally, we improve the information-seeking ability of LLMs under MEDIQ ( 3.2.2). |
| Researcher Affiliation | Academia | 1University of Washington 2Carnegie Mellon University 3Cornell Tech 4Allen Institute for AI stelli@cs.washington.edu |
| Pseudocode | No | The paper describes its components and logic through textual descriptions and flowcharts (e.g., Figure 3), but it does not contain explicitly labeled 'Pseudocode' or 'Algorithm' blocks with structured, code-like steps. |
| Open Source Code | Yes | https://github.com/stellalisy/mediQ |
| Open Datasets | Yes | We convert two medical QA datasets, MEDQA (CC-BY 4.0) (Jin et al., 2021) and CRAFT-MD (CC-BY 4.0) (Johri et al., 2023, 2024), into an interactive setup for our experiments. |
| Dataset Splits | Yes | MEDQA is a standard benchmark for medical question answering with 10178/1272/1273 train/dev/test samples. |
| Hardware Specification | Yes | We use CPU only for the GPT-based experiments, one A40 GPU for the smaller Llama models (7B, 8B, & 13B), and two A40 GPUs for the 70B models. |
| Software Dependencies | Yes | For the Open AI models, we use the gpt-3.5-turbo-0125 version for GPT-3.5 and the gpt-4-turbo-2024-04-09 version for GPT-4. |
| Experiment Setup | Yes | For both the Patient and Expert systems, we use a temperature of 0.5 and top_p = 1 for top p sampling. |