Towards Last-layer Retraining for Group Robustness with Fewer Annotations

Authors: Tyler LaBonte, Vidya Muthukumar, Abhishek Kumar

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our empirical and theoretical results present the first evidence that model disagreement upsamples worst-group data, enabling SELF to nearly match DFR on four well-established benchmarks across vision and language tasks with no group annotations and less than 3% of the held-out class annotations.
Researcher Affiliation Collaboration 1H. Milton Stewart School of Industrial and Systems Engineering, Georgia Institute of Technology 2School of Electrical and Computer Engineering, Georgia Institute of Technology 3Google Deep Mind
Pseudocode No The paper describes methods in narrative text and does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it present structured, code-like steps for any procedure.
Open Source Code Yes Our code is available at https://github.com/tmlabonte/last-layer-retraining.
Open Datasets Yes We study four datasets which are well-established as benchmarks for group robustness across vision and language tasks, detailed in Table 2 and summarized below. Waterbirds [71, 69, 58] is an image classification dataset... Celeb A [42, 58] is an image classification dataset... Civil Comments [7, 35] is a text classification dataset... Multi NLI [73, 58] is a text classification dataset...
Dataset Splits Yes Table 2: Dataset composition. ... Train Val Test ... Following previous work, we use half the validation set for feature reweighting [33, 28] and half for model selection with group annotations [58, 41, 33, 48, 28].
Hardware Specification Yes Our experiments were conducted on Nvidia Tesla V100 and A5000 GPUs.
Software Dependencies No The paper lists several software packages (e.g., 'Num Py [22], Py Torch [53], Lightning [72], Torch Vision [44], Matplotlib [26], Transformers [74], and Milkshake [36]') but does not explicitly provide specific version numbers for these dependencies, relying instead on citations to general resources or past conference papers.
Experiment Setup Yes Table 11: ERM and last-layer retraining hyperparameters. We use standard hyperparameters following previous work [58, 27, 33, 28]. For last-layer retraining, we keep all hyperparameters the same except the number of epochs on Celeb A, which we increase to 100.