reproducibilityindex.ai

ED-Copilot: Reduce Emergency Department Wait Time with Language Model Diagnostic Assistance

Authors: Liwen Sun, Abhineet Agarwal, Aaron Kornblith, Bin Yu, Chenyan Xiong

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on MIMIC-ED-Assist show that ED-Copilot outperforms state-of-the-art tree models while halving time-costs of laboratory testing from four hours to two hours. Our analyses also confirm the benefit of ED-Copilot s personalized modeling approach. In Section 2, we review related work. We discuss MIMIC-ED-Assist and ED-Copilot in Sections 3 and 4 respectively. Sections 5 and 6 discuss our experimental set-up and results.
Researcher Affiliation	Academia	1Language Technologies Institute, Carnegie Mellon University 2Department of Statistics, University of California, Berkeley 3Department of Emergency Medicine & Pediatrics, University of California, San Francisco 4Department of EECS, University of California, Berkeley. Correspondence to: Chenyan Xiong <cx@cs.cmu.edu>.
Pseudocode	Yes	Algorithm 1 Proximal Policy Optimization (PPO)
Open Source Code	Yes	Our code is available at https: //github.com/cxcscmu/ED-Copilot.
Open Datasets	Yes	In collaboration with ED clinicians, we use public patient data to curate MIMIC-EDAssist, a benchmark for AI systems to suggest laboratory tests that minimize wait time while accurately predicting critical outcomes such as death. MIMIC-ED-Assist is derived from MIMIC-IV (Johnson et al., 2023b) and related datasets (Xie et al., 2022). Our pipeline to create MIMIC-ED-Assist from the MIMIC-IV dataset can be found at https: //github.com/cxcscmu/ED-Copilot. After completing a training course and signing a data use agreement regarding patient information privacy, individuals will gain access to MIMIC-IV and can utilize our pipeline to create MIMIC-ED-Assist.
Dataset Splits	Yes	We randomly split the dataset using 80% for training, 10% for validation, and 10% for testing, while ensuring each split has the same class distribution. The validation set is used to tune hyper-parameters.
Hardware Specification	Yes	All experiments, training and hyper-parameter tuning are conducted on one NVIDIA RTX A6000 GPU.
Software Dependencies	No	The paper mentions software like "Bio GPT (Luo et al., 2022)" and "stable-baseline3 (Raffin et al., 2021)" but does not provide specific version numbers for these or other ancillary software components.
Experiment Setup	Yes	We list the hyper-parameters in Table 8, including the supervised fine-tuning and reinforcement learning stage. In the RL stage, we use grid-search to tune α and β to balance the trade-off between accuracy and cost. The search scope for α { 1 / 2, 1, 2, 4, 8, 15, 16, 32, 64, 256} and β { 1 / 10, 1, 10, 100, 1000}.