Label Noise SGD Provably Prefers Flat Global Minimizers
Authors: Alex Damian, Tengyu Ma, Jason D. Lee
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Section 4 presents experimental results which support our theory. Finally, Section 6 discusses the implications of this work. and 4 Experiments In order to test the ability of SGD with label noise to escape poor global minimizers and converge to better minimizers, we initialize Algorithm 1 at global minimizers of the training loss which achieve 100% training accuracy yet generalize poorly to the test set. |
| Researcher Affiliation | Academia | Alex Damian Princeton University ad27@princeton.edu Tengyu Ma Stanford University tengyuma@stanford.edu Jason Lee Princeton University jasonlee@princeton.edu |
| Pseudocode | Yes | Algorithm 1: SGD with Label Noise Input: θ0, step size η, noise variance σ2, batch size B, steps T |
| Open Source Code | No | Code will be submitted through the supplementary material and will be made available (through Github) upon acceptance. |
| Open Datasets | Yes | Experiments were run with Res Net18 on CIFAR10 [17] without data augmentation or weight decay. For CIFAR10 we cite Krizhevsky [17], as requested by the creators on https://www.cs.toronto.edu/ kriz/cifar.html. |
| Dataset Splits | No | The paper mentions 'training accuracy' and 'test accuracy' in Section 4, but it does not specify the use of a separate validation split, its size, or how it was created. |
| Hardware Specification | Yes | The experiments were performed on NVIDIA GeForce RTX 2080 Ti GPUs. |
| Software Dependencies | No | The code was implemented in PyTorch [24] and PyTorch Lightning [6], and weights and biases [2] was used for experiment tracking. |
| Experiment Setup | Yes | Experiments were run with Res Net18 on CIFAR10 [17] without data augmentation or weight decay. The experiments were conducted with randomized label flipping with probability 0.2 (see Appendix E for the extension of Theorem 1 to classification with label flipping), cross entropy loss, and batch size 256. |