Diagnosing Transformers: Illuminating Feature Spaces for Clinical Decision-Making
Authors: Aliyah R. Hsu, Yeshwanth Cherapanamjeri, Briton Park, Tristan Naumann, Anobel Odisho, Bin Yu
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct a case study investigating the impact of pre-training data where we focus on real-world pathology classification tasks, and validate our findings on Med NLI. We evaluate five 110M-sized pre-trained transformer models, categorized into general-domain (BERT, TNLR), mixed-domain (Bio BERT, Clinical Bio BERT), and domain-specific (Pub Med BERT) groups. Our SUFO analyses reveal that: (1) while Pub Med BERT, the domain-specific model, contains valuable information for fine-tuning, it can overfit to minority classes when class imbalances exist. In contrast, mixed-domain models exhibit greater resistance to overfitting, suggesting potential improvements in domain-specific model robustness; (2) in-domain pre-training accelerates feature disambiguation 1 during fine-tuning; and (3) feature spaces undergo significant sparsification during this process, enabling clinicians to identify common outlier modes among finetuned models as demonstrated in this paper. |
| Researcher Affiliation | Collaboration | Aliyah R. Hsu Department of EECS UC Berkeley aliyahhsu@berkeley.edu Yeshwanth Cherapanamjeri CSAIL, MIT Briton Park Department of Statistics UC Berkeley Tristan Naumann Microsoft Research Anobel Y. Odisho Department of Urology, Epidemiology and Biostatistics UC San Francisco Bin Yu Department of Statistics, EECS Center for Computational Biology UC Berkeley |
| Pseudocode | No | The paper does not contain any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our source code is available at https://github.com/adelaidehsu/path_model_ evaluation |
| Open Datasets | Yes | We additionally report the fine-tuning results of the models on a publicly available clinical dataset, Med NLI (Romanov & Shivade). |
| Dataset Splits | Yes | The datasets are divided into 71% training, 18% validation, and 11% test, with label distribution in each set resembling the distribution in the full datasets. |
| Hardware Specification | Yes | Each model is fine-tuned on a single NVIDIA Tesla K80 GPU, and average fine-tuning time is around 3 hours. Each model is fine-tuned on a single NVIDIA Ge Force GTX TITAN X GPU, and the fine-tuning time on average is less than 1 hours. |
| Software Dependencies | No | The paper mentions using an Adam W optimizer but does not specify version numbers for programming languages, libraries, or other software dependencies. |
| Experiment Setup | Yes | We use consistent fine-tuning hyperparameters for all models and all the four tasks, as we observe the validation set performance is not very sensitive to hyperparameter selection (less than 1% F1 performance change). We use an Adam W optimizer with a 7.6 10 6 learning rate, 0.01 weight decay, and a 1 10 8 epsilon. We also adopt a linear learning rate schedule with a 0.2 warm-up ratio. We fine-tune for a maximum of 25 epochs with a batch size of 8 and evaluate every 50 steps on the validation set. |