Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Evaluating Robustness to Dataset Shift via Parametric Robustness Sets
Authors: Nikolaj Thams, Michael Oberst, David Sontag
NeurIPS 2022 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We apply our approach to a computer vision task (classifying gender from images), revealing sensitivity to shifts in non-causal attributes. In a computer vision task, we find that this approach finds more impactful shifts than a reweighting approach, while taking far less time to compute, and that the resulting estimates of accuracy are substantially more reliable (see Section 4). We simulate K = 100 validation sets from P, in each estimating the worst-case shifts δTaylor (via the approach in Section 3.3) and δIS, where the latter corresponds to minimizing ˆEδ,IS using a standard non-convex solver from the scipy library [Virtanen et al., 2020]. We simulate ground truth data from PδIS and PδTaylor, to compare the two shifts. |
| Researcher Affiliation | Academia | Nikolaj Thams Dept. of Mathematical Sciences University of Copenhagen Copenhagen, Denmark EMAIL Michael Oberst CSAIL & IMES MIT Cambridge, MA EMAIL David Sontag CSAIL & IMES MIT Cambridge, MA EMAIL |
| Pseudocode | No | The paper provides mathematical formulations and descriptions of the approach, but does not include any explicit pseudocode blocks or sections labeled 'Algorithm'. |
| Open Source Code | Yes | Code is available at this link. |
| Open Datasets | Yes | To illustrate this use-case, we make use of the Celeb A dataset [Liu et al., 2015], which contains images of faces and binary attributes (e.g., glasses, beard, etc.) encoding several features whose correlations may be unstable (e.g., the relation between gender and being bald). |
| Dataset Splits | No | We simulate K = 100 validation sets from P, in each estimating the worst-case shifts δTaylor (via the approach in Section 3.3) and δIS, where the latter corresponds to minimizing ˆEδ,IS using a standard non-convex solver from the scipy library [Virtanen et al., 2020]. |
| Hardware Specification | No | The paper mentions finetuning a ResNet50 classifier but does not provide specific hardware details like GPU/CPU models, processor types, or memory amounts used for running experiments. |
| Software Dependencies | No | The paper mentions using a 'standard non-convex solver from the scipy library' and 'finetuning a pretrained ResNet50 classifier', but it does not specify version numbers for these or any other software dependencies. |
| Experiment Setup | No | The paper states using a 'ResNet50 classifier' and '0/1 loss' and constraints like 'δ 2 λ = 2' but does not provide specific experimental setup details such as learning rates, batch sizes, optimizers, or training schedules. |