Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Data Augmentation as Feature Manipulation
Authors: Ruoqi Shen, Sebastien Bubeck, Suriya Gunasekar
ICML 2022 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our main contribution is a detailed analysis of data augmentation on the learning dynamic for a two layer convolutional neural network in the recently proposed multi-view data model by Allen-Zhu & Li (2020b). We complement this analysis with further experimental evidence that data augmentation can be viewed as feature manipulation. |
| Researcher Affiliation | Collaboration | 1University of Washington. Part of this work was done as a intern at Microsoft Research. 2Microsoft Research. |
| Pseudocode | No | The paper describes algorithms and derivations using mathematical formulas and prose, but it does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not contain any explicit statements about releasing source code for the methodology described, nor does it provide a link to a code repository. |
| Open Datasets | Yes | We complement our analysis with experiments on CIFAR-10 and synthetic datasets, where we study data augmentation in more generality. |
| Dataset Splits | No | The paper mentions 'training examples' and 'test dataset' (e.g., 'full CIFAR-10 dataset which has 50000 training examples for 10 classes' and 'We use the standard CIFAR-10 test dataset') but does not specify exact training/validation/test split percentages or counts for reproducibility. |
| Hardware Specification | No | The paper does not provide specific details about the hardware (e.g., GPU models, CPU types, memory) used to run the experiments. |
| Software Dependencies | No | The paper mentions training networks (ResNet20) and using SGD, but it does not specify any software dependencies with version numbers (e.g., PyTorch, TensorFlow versions, or specific library versions). |
| Experiment Setup | Yes | In all configurations, we train a Res Net20 network using SGD for 120 epochs with momentum 0.9, weight decay 0.005, and learning rate starting at 0.1 and annealed to (0.01, 0.001) at epochs (40, 80). |