Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
A Group-Theoretic Framework for Data Augmentation
Authors: Shuxiao Chen, Edgar Dobriban, Jane H. Lee
JMLR 2020 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | See Figure 1 for a small experiment (see Appendix D for details). ...Figure 1: Benefits of data augmentation: A comparison of test accuracy across training epochs of Res Net18 (He et al., 2016) (1) without data augmentation, (2) horizontally flipping the image with 0.5 probability, and (3) a composition of randomly cropping a 32 32 portion of the image and random horizontal flip. The experiment is repeated 15 times, with the dotted lines showing the average test accuracy and the shaded regions representing 1 standard deviation around the mean. |
| Researcher Affiliation | Academia | Shuxiao Chen EMAIL Edgar Dobriban EMAIL Department of Statistics The Wharton School of the University of Pennsylvania Philadelphia, PA, 19104-6340, USA Jane H. Lee EMAIL Department of Mathematics, and Computer and Information Science University of Pennsylvania Philadelphia, PA 19104-6309, USA |
| Pseudocode | Yes | Algorithm 1: Augmented SGD Input : Data Xi, i = 1, . . . , n; Method to compute gradients L(θ, X) of the loss; Method to sample augmentations g G, g Q; Learning rates ηt; Batch sizes |St|; Initial parameters θ0; Stopping criterion. Output: Final parameters. Set t = 0 While stopping criterion is not met Sample random minibatch St {1, . . . , n} Sample random augmentation gi,t Q for each batch element Update parameters i St L(θ, gi,t Xi) t t + 1 return θ |
| Open Source Code | Yes | Our code is available at https://github.com/dobriban/data_aug. |
| Open Datasets | Yes | We train Res Net18 (He et al., 2016) on CIFAR10 (Krizhevsky, 2009)... The CIFAR10 dataset is standard and can be downloaded from https://www.cs.toronto. edu/~kriz/cifar.html. |
| Dataset Splits | Yes | The left graph shows results from training on the full CIFAR10 training data and the right uses half of the training data as that of the left. |
| Hardware Specification | Yes | This experiment was done on a p3.2xlarge (GPU) instance on Amazon Web Services (AWS). |
| Software Dependencies | No | The paper mentions software components like "Res Net18" and "pytorch-cifar" but does not specify version numbers for these or other software libraries used. |
| Experiment Setup | Yes | We use the default settings from that code, including the SGD optimizer with a learning rate of 0.1, momentum 0.9, weight decay 5 10 4, and batch size of 128. |