Label Noise SGD Provably Prefers Flat Global Minimizers

Authors: Alex Damian, Tengyu Ma, Jason D. Lee

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Section 4 presents experimental results which support our theory. Finally, Section 6 discusses the implications of this work. and 4 Experiments In order to test the ability of SGD with label noise to escape poor global minimizers and converge to better minimizers, we initialize Algorithm 1 at global minimizers of the training loss which achieve 100% training accuracy yet generalize poorly to the test set.
Researcher Affiliation Academia Alex Damian Princeton University ad27@princeton.edu Tengyu Ma Stanford University tengyuma@stanford.edu Jason Lee Princeton University jasonlee@princeton.edu
Pseudocode Yes Algorithm 1: SGD with Label Noise Input: θ0, step size η, noise variance σ2, batch size B, steps T
Open Source Code No Code will be submitted through the supplementary material and will be made available (through Github) upon acceptance.
Open Datasets Yes Experiments were run with Res Net18 on CIFAR10 [17] without data augmentation or weight decay. For CIFAR10 we cite Krizhevsky [17], as requested by the creators on https://www.cs.toronto.edu/ kriz/cifar.html.
Dataset Splits No The paper mentions 'training accuracy' and 'test accuracy' in Section 4, but it does not specify the use of a separate validation split, its size, or how it was created.
Hardware Specification Yes The experiments were performed on NVIDIA GeForce RTX 2080 Ti GPUs.
Software Dependencies No The code was implemented in PyTorch [24] and PyTorch Lightning [6], and weights and biases [2] was used for experiment tracking.
Experiment Setup Yes Experiments were run with Res Net18 on CIFAR10 [17] without data augmentation or weight decay. The experiments were conducted with randomized label flipping with probability 0.2 (see Appendix E for the extension of Theorem 1 to classification with label flipping), cross entropy loss, and batch size 256.