Data Augmentations for Improved (Large) Language Model Generalization

Authors: Amir Feder, Yoav Wald, Claudia Shi, Suchi Saria, David Blei

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through extensive experimentation on learning caregiver-invariant predictors of clinical diagnoses from medical narratives and on semi-synthetic data, we demonstrate that our method for simulating interventions improves out-of-distribution (OOD) accuracy compared to baseline invariant learning algorithms.We empirically study the following questions: (1) Can CATO enhance OOD performance of downstream classifiers? (2) Does it surpass the combination of reweighting and invariance penalties? (3) Is it more effective than alternative augmentation techniques, thus demonstrating the usefulness of the causal graph? (4) How sensitive is CATO to quality of counterfactuals? These questions seek to establish causally-motivated augmentations as a practical approach for improving OOD performance. We address Q#1,#2 and #3 through our theoretical foundation and across all empirical studies, while Q#4 is explored in the synthetic experiments. Further details about the experimental setup, including data statistics, model hyperparameters, and data splits, can be found in Appendix B. Table 1 provides an overview of the tasks we experiment with.
Researcher Affiliation Collaboration Amir Feder 1,2, Yoav Wald 3, Claudia Shi 1, Suchi Saria 3 and David Blei 1 1 Columbia University, 2 Google Research, 3 Johns Hopkins University
Pseudocode Yes Algorithm 1 CATO
Open Source Code No No explicit statement about releasing code or a link to a repository for the described methodology.
Open Datasets Yes We utilize several electronic health records (EHR) datasets. We train on MIMIC-III [86], a widely-used medical dataset containing over 2 million notes from 38,597 adult patients, 49,785 hospital admissions, and 3,500 healthcare professionals between 2001 and 2012.We use the CEBa B dataset [49], which consists of short restaurant reviews and ratings from Open Table, including evaluations for food, service, noise, ambiance, and an overall rating.
Dataset Splits Yes Further details about the experimental setup, including data statistics, model hyperparameters, and data splits, can be found in Appendix B.To estimate divergences between these two distributions, we may use validation sets from our training data.
Hardware Specification No No specific hardware details (GPU models, CPU models, etc.) are mentioned in the paper.
Software Dependencies No While some software is mentioned (e.g., 'Pub MED BERT', 'Pytorch', 'Huggingface's transformers', 'Scikitlearn', 'GPT-4'), specific version numbers for these dependencies are not provided within the text.
Experiment Setup Yes Further details about the experimental setup, including data statistics, model hyperparameters, and data splits, can be found in Appendix B.