MediQ: Question-Asking LLMs and a Benchmark for Reliable Interactive Clinical Reasoning

Authors: Stella Li, Vidhisha Balachandran, Shangbin Feng, Jonathan Ilgen, Emma Pierson, Pang Wei W. Koh, Yulia Tsvetkov

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct experiments to validate each component of MEDIQ. First, we evaluate the Patient system with factuality and relevance metrics ( 3.1). Then, we establish the correlation between information availability and accuracy by studying model performance with varying levels of input information ( 3.2.1). Finally, we improve the information-seeking ability of LLMs under MEDIQ ( 3.2.2).
Researcher Affiliation Academia 1University of Washington 2Carnegie Mellon University 3Cornell Tech 4Allen Institute for AI stelli@cs.washington.edu
Pseudocode No The paper describes its components and logic through textual descriptions and flowcharts (e.g., Figure 3), but it does not contain explicitly labeled 'Pseudocode' or 'Algorithm' blocks with structured, code-like steps.
Open Source Code Yes https://github.com/stellalisy/mediQ
Open Datasets Yes We convert two medical QA datasets, MEDQA (CC-BY 4.0) (Jin et al., 2021) and CRAFT-MD (CC-BY 4.0) (Johri et al., 2023, 2024), into an interactive setup for our experiments.
Dataset Splits Yes MEDQA is a standard benchmark for medical question answering with 10178/1272/1273 train/dev/test samples.
Hardware Specification Yes We use CPU only for the GPT-based experiments, one A40 GPU for the smaller Llama models (7B, 8B, & 13B), and two A40 GPUs for the 70B models.
Software Dependencies Yes For the Open AI models, we use the gpt-3.5-turbo-0125 version for GPT-3.5 and the gpt-4-turbo-2024-04-09 version for GPT-4.
Experiment Setup Yes For both the Patient and Expert systems, we use a temperature of 0.5 and top_p = 1 for top p sampling.