Multi-Label Few-Shot ICD Coding as Autoregressive Generation with Prompt

Authors: Zhichao Yang, Sunjae Kwon, Zonghai Yao, Hong Yu

AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our Generation with Prompt (GPsoap) model with the benchmark of all code assignment (MIMIC-III-full) and few shot ICD code assignment evaluation benchmark (MIMIC-III-few). Experiments on MIMICIII-few show that our model performs with a marco F1 30.2, which substantially outperforms the previous MIMICIII-full SOTA model (marco F1 4.3) and the model specifically designed for few/zero shot setting (marco F1 18.7).
Researcher Affiliation Academia 1College of Information and Computer Sciences, University of Massachusetts Amherst 2Department of Computer Science, University of Massachusetts Lowell 3Center for Healthcare Organization and Implementation Research, Veterans Affairs Bedford Healthcare System zhichaoyang@umass.edu
Pseudocode No The paper describes the methods used but does not provide structured pseudocode or algorithm blocks.
Open Source Code Yes Our codes are attached in supplementary material and will be publicly available upon publication. Our evaluation code is publicly available2. 2https://github.com/whaleloops/KEPT
Open Datasets Yes The fine-tuning dataset (Johnson et al. 2016) contains clinical data from real patients. It contains data instances of de-identified discharge summary note texts with expert-labeled ICD-9 codes. ... MIMIC-III, a freely accessible critical care database. (Johnson et al. 2016)
Dataset Splits Yes For all codes prediction tasks (MIMIC-III-full), we used the same splits as the previous work (Mullenbach et al. 2018; Yuan, Tan, and Huang 2022).
Hardware Specification Yes Pretraining on SOAP data took about 140 hours with 4 NVIDIA RTX 6000 GPU with 24 GB memory. Fine-tuning took about 40 hours with 4 NVIDIA RTX 6000 GPU with 24 GB memory. Our reranker training took about 12 hours with 2 NVIDIA A100 GPU with 40 GB memory.
Software Dependencies No The paper details hyper-parameters and training configurations such as learning rates and dropout rates, but it does not specify versions for software dependencies like programming languages or libraries (e.g., Python, PyTorch).
Experiment Setup Yes During pretraining, we used warmup ratio of 0.1, learning rate 5e 5, dropout rate 0.1, L2 weight decay 1e 3 and batch size of 64 with fp16. During fine-tuning, we grid searched learning rate [1e 5, 2e 5, 3e 5], dropout rate [0.1, 0.3, 0.5], with batch size of 4. Best hyper-parameters set is bolded. The random seed is 42.