Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Data Augmentation Can Improve Robustness
Authors: Sylvestre-Alvise Rebuffi, Sven Gowal, Dan Andrei Calian, Florian Stimberg, Olivia Wiles, Timothy A Mann
NeurIPS 2021 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our approach on CIFAR-10 against ℓ and ℓ2 norm-bounded perturbations of size ϵ = 8/255 and ϵ = 128/255, respectively. We show large absolute improvements of +2.93% and +2.16% in robust accuracy compared to previous state-of-the-art methods. We conduct thorough experiments to show that our approach generalizes across architectures, datasets and threat models. |
| Researcher Affiliation | Industry | Sylvestre-Alvise Rebuffi*, Sven Gowal*, Dan Calian, Florian Stimberg, Olivia Wiles and Timothy Mann Deep Mind, London EMAIL |
| Pseudocode | No | No explicit pseudocode or algorithm blocks were found in the paper. |
| Open Source Code | Yes | The code written in JAX [4] and Haiku [26] is available online at https://github.com/ deepmind/deepmind-research/tree/master/adversarial_robustness. |
| Open Datasets | Yes | We evaluate our approach on CIFAR-10 against ℓ and ℓ2 norm-bounded perturbations of size ϵ = 8/255 and ϵ = 128/255, respectively. We also achieve a significant performance boost with this approach while using other architectures and datasets such as CIFAR-100, SVHN and TINYIMAGENET. |
| Dataset Splits | Yes | Specifically, we train two (and only two) models for each hyperparameter setting, perform early stopping for each model on a separate validation set of 1024 samples using PGD40 similarly to Rice et al. [44] and pick the best model by evaluating the robust accuracy on the same validation set . |
| Hardware Specification | Yes | We train for 400 epochs with a batch size of 512 split over 32 Google Cloud TPU v3 cores [4], and the learning rate is initially set to 0.1 and decayed by a factor 10 two-thirds-of-the-way through training. |
| Software Dependencies | No | The paper mentions JAX [4] and Haiku [26] but does not provide specific version numbers for these or any other software dependencies. |
| Experiment Setup | Yes | We train for 400 epochs with a batch size of 512 split over 32 Google Cloud TPU v3 cores [4], and the learning rate is initially set to 0.1 and decayed by a factor 10 two-thirds-of-the-way through training. We scale the learning rates using the linear scaling rule of Goyal et al. [21] (i.e., effective LR = max(LR batch size/256, LR)). The decay rate of WA is set to τ = 0.999. |