Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
How Data Augmentation affects Optimization for Linear Regression
Authors: Boris Hanin, Yi Sun
NeurIPS 2021 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To validate Theorems 4.1 and 4.2, we ran augmented GD and SGD with additive Gaussian noise on N = 100 simulated datapoints. ... Figure 4.1 shows MSE and Wt, F along a single optimization trajectory with different schedules for the variance σ2 t used in Gaussian noise augmentation. |
| Researcher Affiliation | Academia | Boris Hanin Department of Operations Research and Financial Engineering Princeton University EMAIL; Yi Sun Department of Statistics University of Chicago EMAIL |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Complete code to generate this figure is provided in supplement.zip in the supplement. |
| Open Datasets | No | The paper states 'N = 100 simulated datapoints' and 'Inputs were i.i.d. Gaussian vectors in dimension n = 400', indicating the data was generated for the experiments rather than being a publicly accessible dataset with concrete access information. |
| Dataset Splits | No | The paper mentions running experiments on 'simulated datapoints' but does not provide specific details about training, validation, or test dataset splits, percentages, or sample counts. |
| Hardware Specification | No | The paper mentions 'It ran in 30 minutes on a standard laptop CPU.' This is a general statement and does not provide specific hardware details such as CPU model, GPU models, or memory. |
| Software Dependencies | No | The paper does not provide specific version numbers for any software dependencies used in the experiments (e.g., Python, PyTorch, TensorFlow, etc.). |
| Experiment Setup | Yes | The learning rate followed a fixed polynomially decaying schedule ηt = 0.005 / (100 * (batch size)) / (1 + t / 20)^0.66, and the batch size used for SGD was 20. |