Beyond Implicit Bias: The Insignificance of SGD Noise in Online Learning

Authors: Nikhil Vyas, Depen Morwani, Rosie Zhao, Gal Kaplun, Sham M. Kakade, Boaz Barak

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through an extensive empirical analysis of image and language data, we demonstrate that small batch sizes do not confer any implicit bias advantages in online learning.
Researcher Affiliation Academia 1SEAS, Harvard University 2Kempner Institute, Harvard University. Correspondence to: Nikhil Vyas <nikhil@g.harvard.edu>, Depen Morwani <dmorwani@g.harvard.edu>.
Pseudocode No The paper describes experimental procedures and theoretical derivations but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No The paper does not provide any statement or link indicating that the source code for the described methodology is publicly available.
Open Datasets Yes Specifically, we run Res Net-18 on CIFAR-5m (Nakkiran et al., 2021), a synthetically generated version of CIFAR-10 with 5 million examples, Conv Next-T on Image Net, and GPT-2-small on C4.
Dataset Splits No The paper mentions training on subsets of datasets (e.g., 'random subset of 50k samples' for CIFAR-5m offline, '128k examples' for Image Net offline, 'random subset of roughly 100 million tokens' for C4 offline), and evaluates on a test set, but it does not provide specific details for training/validation/test splits (e.g., percentages or exact counts for all splits) needed for reproduction.
Hardware Specification No The paper does not provide specific details about the hardware used for the experiments, such as GPU models, CPU types, or memory configurations.
Software Dependencies No The paper mentions using PyTorch, OpenCV, and specific optimizers (SGD, Adam W, Adam) but does not provide specific version numbers for any software dependencies.
Experiment Setup Yes For our CIFAR-5m experiments, we trained Res Net-18, on normalized (across channels) images and using the SGD optimizer with 0.9 momentum. For both offline and online learning, we used a learning rate of 0.025. ... For the Image Net experiments, we used Conv Next-T... used a batch size of 2048 and learning rate of 1e-4 with the Adam W optimizer with weight decay 0.005. ... For all experiments we trained GPT-2-small (124m parameters) on the C4 dataset with sequence length 2048. The optimizer we use is Adam without weight decay and a constant learning rate of 6 10 4.