reproducibilityindex.ai

MediQ: Question-Asking LLMs and a Benchmark for Reliable Interactive Clinical Reasoning

Authors: Stella Li, Vidhisha Balachandran, Shangbin Feng, Jonathan Ilgen, Emma Pierson, Pang Wei W. Koh, Yulia Tsvetkov

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct experiments to validate each component of MEDIQ. First, we evaluate the Patient system with factuality and relevance metrics ( 3.1). Then, we establish the correlation between information availability and accuracy by studying model performance with varying levels of input information ( 3.2.1). Finally, we improve the information-seeking ability of LLMs under MEDIQ ( 3.2.2).
Researcher Affiliation	Academia	1University of Washington 2Carnegie Mellon University 3Cornell Tech 4Allen Institute for AI stelli@cs.washington.edu
Pseudocode	No	The paper describes its components and logic through textual descriptions and flowcharts (e.g., Figure 3), but it does not contain explicitly labeled 'Pseudocode' or 'Algorithm' blocks with structured, code-like steps.
Open Source Code	Yes	https://github.com/stellalisy/mediQ
Open Datasets	Yes	We convert two medical QA datasets, MEDQA (CC-BY 4.0) (Jin et al., 2021) and CRAFT-MD (CC-BY 4.0) (Johri et al., 2023, 2024), into an interactive setup for our experiments.
Dataset Splits	Yes	MEDQA is a standard benchmark for medical question answering with 10178/1272/1273 train/dev/test samples.
Hardware Specification	Yes	We use CPU only for the GPT-based experiments, one A40 GPU for the smaller Llama models (7B, 8B, & 13B), and two A40 GPUs for the 70B models.
Software Dependencies	Yes	For the Open AI models, we use the gpt-3.5-turbo-0125 version for GPT-3.5 and the gpt-4-turbo-2024-04-09 version for GPT-4.
Experiment Setup	Yes	For both the Patient and Expert systems, we use a temperature of 0.5 and top_p = 1 for top p sampling.