Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Towards Last-layer Retraining for Group Robustness with Fewer Annotations
Authors: Tyler LaBonte, Vidya Muthukumar, Abhishek Kumar
NeurIPS 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our empirical and theoretical results present the first evidence that model disagreement upsamples worst-group data, enabling SELF to nearly match DFR on four well-established benchmarks across vision and language tasks with no group annotations and less than 3% of the held-out class annotations. |
| Researcher Affiliation | Collaboration | 1H. Milton Stewart School of Industrial and Systems Engineering, Georgia Institute of Technology 2School of Electrical and Computer Engineering, Georgia Institute of Technology 3Google Deep Mind |
| Pseudocode | No | The paper describes methods in narrative text and does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it present structured, code-like steps for any procedure. |
| Open Source Code | Yes | Our code is available at https://github.com/tmlabonte/last-layer-retraining. |
| Open Datasets | Yes | We study four datasets which are well-established as benchmarks for group robustness across vision and language tasks, detailed in Table 2 and summarized below. Waterbirds [71, 69, 58] is an image classification dataset... Celeb A [42, 58] is an image classification dataset... Civil Comments [7, 35] is a text classification dataset... Multi NLI [73, 58] is a text classification dataset... |
| Dataset Splits | Yes | Table 2: Dataset composition. ... Train Val Test ... Following previous work, we use half the validation set for feature reweighting [33, 28] and half for model selection with group annotations [58, 41, 33, 48, 28]. |
| Hardware Specification | Yes | Our experiments were conducted on Nvidia Tesla V100 and A5000 GPUs. |
| Software Dependencies | No | The paper lists several software packages (e.g., 'Num Py [22], Py Torch [53], Lightning [72], Torch Vision [44], Matplotlib [26], Transformers [74], and Milkshake [36]') but does not explicitly provide specific version numbers for these dependencies, relying instead on citations to general resources or past conference papers. |
| Experiment Setup | Yes | Table 11: ERM and last-layer retraining hyperparameters. We use standard hyperparameters following previous work [58, 27, 33, 28]. For last-layer retraining, we keep all hyperparameters the same except the number of epochs on Celeb A, which we increase to 100. |