Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Towards Last-layer Retraining for Group Robustness with Fewer Annotations

Authors: Tyler LaBonte, Vidya Muthukumar, Abhishek Kumar

NeurIPS 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our empirical and theoretical results present the ﬁrst evidence that model disagreement upsamples worst-group data, enabling SELF to nearly match DFR on four well-established benchmarks across vision and language tasks with no group annotations and less than 3% of the held-out class annotations.
Researcher Affiliation	Collaboration	1H. Milton Stewart School of Industrial and Systems Engineering, Georgia Institute of Technology 2School of Electrical and Computer Engineering, Georgia Institute of Technology 3Google Deep Mind
Pseudocode	No	The paper describes methods in narrative text and does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it present structured, code-like steps for any procedure.
Open Source Code	Yes	Our code is available at https://github.com/tmlabonte/last-layer-retraining.
Open Datasets	Yes	We study four datasets which are well-established as benchmarks for group robustness across vision and language tasks, detailed in Table 2 and summarized below. Waterbirds [71, 69, 58] is an image classiﬁcation dataset... Celeb A [42, 58] is an image classiﬁcation dataset... Civil Comments [7, 35] is a text classiﬁcation dataset... Multi NLI [73, 58] is a text classiﬁcation dataset...
Dataset Splits	Yes	Table 2: Dataset composition. ... Train Val Test ... Following previous work, we use half the validation set for feature reweighting [33, 28] and half for model selection with group annotations [58, 41, 33, 48, 28].
Hardware Specification	Yes	Our experiments were conducted on Nvidia Tesla V100 and A5000 GPUs.
Software Dependencies	No	The paper lists several software packages (e.g., 'Num Py [22], Py Torch [53], Lightning [72], Torch Vision [44], Matplotlib [26], Transformers [74], and Milkshake [36]') but does not explicitly provide specific version numbers for these dependencies, relying instead on citations to general resources or past conference papers.
Experiment Setup	Yes	Table 11: ERM and last-layer retraining hyperparameters. We use standard hyperparameters following previous work [58, 27, 33, 28]. For last-layer retraining, we keep all hyperparameters the same except the number of epochs on Celeb A, which we increase to 100.