reproducibilityindex.ai

A Label Attention Model for ICD Coding from Clinical Text

Authors: Thanh Vu, Dat Quoc Nguyen, Anthony Nguyen

IJCAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	As our ﬁnal contribution, we extensively evaluate our models on three standard benchmark MIMIC datasets [Lee et al., 2011; Johnson et al., 2016], which are widely used in automatic ICD coding research [Perotte et al., 2013; Prakash et al., 2017; Mullenbach et al., 2018; Xie et al., 2019; Li and Yu, 2020]. Experimental results show that our model obtains the new SOTA performance results across evaluation metrics. In addition, our joint learning mechanism helps improve the performances for infrequent codes.
Researcher Affiliation	Collaboration	1Australian e-Health Research Centre, CSIRO, Brisbane, Australia 2Vin AI Research, Hanoi, Vietnam
Pseudocode	No	The paper presents model architectures with diagrams and mathematical equations, but it does not include explicit pseudocode or algorithm blocks.
Open Source Code	No	The paper does not provide any specific links or statements about the availability of open-source code for the described methodology.
Open Datasets	Yes	We follow recent SOTA work on ICD coding from clinical text [Mullenbach et al., 2018; Xie et al., 2019; Li and Yu, 2020]: using benchmark Medical Information Mart for Intensive Care (MIMIC) datasets MIMIC-III [Johnson et al., 2016] and MIMIC-II [Lee et al., 2011].
Dataset Splits	Yes	For the ﬁrst experiment of using the full set of codes, the data was split using patient ID so that no patient is appearing in both training and validation/test sets. In particular, there are 47,719 discharge summaries for training, 1,631 for validation and 3,372 for testing. For the second experiment of using the 50 most frequent codes, the resulting subset of 11,317 discharge summaries was obtained, in which there are 8,067 discharge summaries for training, 1,574 for validation and 1,730 for testing. Following the previous work [Perotte et al., 2013; Mullenbach et al., 2018; Li and Yu, 2020], 20,533 and 2,282 clinical notes were used for training and testing, respectively (with a total of 5,031 unique codes). From the set of 20,533 clinical notes, we further use 1,141 notes for validation, resulting in only 19,392 notes for training our model.
Hardware Specification	No	The paper does not explicitly describe the specific hardware (e.g., GPU model, CPU type) used to run the experiments. It only states "We implement our LAAT and Joint LAAT using Py Torch".
Software Dependencies	No	The paper mentions "Py Torch [Paszke et al., 2019]" and "Adam W [Loshchilov and Hutter, 2019]" but does not specify version numbers for these software components, which is required for reproducibility.
Experiment Setup	Yes	We implement our LAAT and Joint LAAT using Py Torch [Paszke et al., 2019]. We train the models with Adam W [Loshchilov and Hutter, 2019], and set its learning rate to the default value of 0.001. The batch size and number of epochs are set to 8 and 50, respectively. We use a learning rate scheduler to automatically reduce the learning rate by 10% if there is no improvement in every 5 epochs. We also implement an early stopping mechanism, in which the training is stopped if there is no improvement of the micro-averaged F1 score on the validation set in 6 continuous epochs. For both LAAT and Joint LAAT, we apply a dropout mechanism with the dropout probability of 0.3. For LAAT, we perform a grid search over the LSTM hidden size u {128, 256, 384, 512} and the projection size da {128, 256, 384, 512}, resulting in the optimal values u at 512 and da at 512 on the MIMIC-III-full dataset, and the optimal values u at 256 and da at 256 on both the MIMIC-III-50 and MIMIC-II-full datasets. For Joint LAAT, we employ the optimal hyper-parameters (da and u) from LAAT and ﬁx the projection size p at 128.