Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Multitask Learning Can Improve Worst-Group Outcomes

Authors: Atharva Kulkarni, Lucio M. Dery, Amrith Setlur, Aditi Raghunathan, Ameet Talwalkar, Graham Neubig

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We run a large number of fine-tuning experiments across computer vision and natural language processing datasets and find that our regularized MTL approach consistently outperforms JTT on both average and worst-group outcomes.
Researcher Affiliation	Academia	Atharva Kulkarni EMAIL Language Technologies Institute, School of Computer Science Carnegie Mellon University; Lucio M. Dery EMAIL Computer Science Department, School of Computer Science Carnegie Mellon University; Amrith Setlur EMAIL Machine Learning Department, School of Computer Science Carnegie Mellon University; Aditi Raghunathan EMAIL Computer Science Department, School of Computer Science Carnegie Mellon University; Ameet Talwalkar EMAIL Machine Learning Department, School of Computer Science Carnegie Mellon University; Graham Neubig EMAIL Language Technologies Institute, School of Computer Science Carnegie Mellon University
Pseudocode	No	The paper includes mathematical equations for loss functions and model parameterizations (e.g., Equation 4, 5, 7, 8, 9) but does not contain any structured pseudocode or algorithm blocks.
Open Source Code	Yes	Our official code can be found here: https://github.com/atharvajk98/MTL-group-robustness.
Open Datasets	Yes	1. Waterbirds: This image classification dataset was introduced by Sagawa et al. (2020a). 2. Multi NLI: This is a natural language inference dataset... (Williams et al., 2018). 3. Civilcomments: The Civilcomments dataset is a toxicity classification dataset... Borkan et al. (2019); Koh et al. (2021).
Dataset Splits	Yes	Civilcomments-small: Our subset contains 13770, 2039, and 4866 datapoints in our train, validation, and test split, respectively.
Hardware Specification	No	The paper mentions training parameters like learning rates and batch sizes, and model architectures (BERTbase, Vi Tbase), but does not specify any particular hardware (e.g., GPU, CPU models, or cloud computing resources) used for the experiments.
Software Dependencies	No	The paper mentions using optimizers like Adam and SGD, and pre-trained models such as BERTbase and Vi Tbase, but does not specify any software library versions (e.g., PyTorch, TensorFlow versions) or other specific software dependencies with version numbers.
Experiment Setup	Yes	For training, we vary the fine-tuning learning rate within the ranges of {10-3, 10-4} for Waterbirds, and {10-4, 10-5} for the text datasets. We experiment with batch sizes in the set {4, 8, 16, 32}. We use the same batch sizes for Tend and Taux. We train for 50 epoch for the NLP datasets and 200 epochs for Waterbirds, with an early stopping patience of 10, as per the check-pointing scheme explained in section 4.2. We use the Adam optimizer for NLP datasets with decoupled weight decay regularization of 10-2 (Loshchilov & Hutter, 2017). Consistent with the recent studies on Vi T (Dosovitskiy et al., 2020; Steiner et al., 2022), we use SGD with a momentum of 0.9 (Sutskever et al., 2013) to fine-tune Waterbirds.