Eliminating Sharp Minima from SGD with Truncated Heavy-tailed Noise
Authors: Xingyu Wang, Sewoong Oh, Chang-Han Rhee
ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Real data experiments on deep learning confirm our theoretical prediction that heavy-tailed SGD with gradient clipping finds a "flatter" local minima and achieves better generalization. ... Section 3 presents numerical experiments that confirm our theory. Section 4 proposes a new algorithm that artificially injects heavy tailed gradient noise in actual deep learning tasks and demonstrate the improved performance. |
| Researcher Affiliation | Academia | 1Northwestern University, 2University of Washington |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper mentions adapting code from another work: "The experiment setting and choice of hyperparameters are mostly adapted from the experiment in Zhu et al. (2019).5https://github.com/uuujf/SGDNoise". However, it does not explicitly state that the authors' own modified or specific implementation code for the methods described in this paper (e.g., "Our 1" or "Our 2") is open-sourced or provide a direct link to it. |
| Open Datasets | Yes | We consider three different tasks: (1) Le Net (Le Cun et al., 1990) on corrupted Fashion MNIST (Xiao et al., 2017), (2) VGG11 (Simonyan & Zisserman, 2014) on SVHN (Netzer et al., 2011), (3) VGG11 on CIFAR10 (Krizhevsky et al., 2009) (see Appendix A for details). |
| Dataset Splits | No | The paper describes using training and test datasets. For instance, "For all tasks we use the entire test dataset when evaluating test accuracy." (Appendix A.3). However, there is no explicit mention of a separate validation dataset split used for hyperparameter tuning or model selection. |
| Hardware Specification | Yes | We first mention that the all experiments using neural networks are conducted on Nvidia Ge Force GTX 1080 Ti. |
| Software Dependencies | No | The paper mentions models like Le Net and VGG11/VGG16 and references PyTorch indirectly through a GitHub link in the appendix, but it does not specify concrete version numbers for any software libraries or dependencies (e.g., PyTorch version, Python version, CUDA version) used in their experiments. |
| Experiment Setup | Yes | The experiment setting and choice of hyperparameters are adapted from (Zhu et al., 2019). ... Table A.2: Hyperparameters for training in the ablation study lists: learning rate 0.05, batch size for g SB 100, training iterations 10,000, gradient clipping threshold 5, c 0.5, α 1.4 for various tasks. ... In the first phase (the first 200 epochs), the learning rate is kept at a constant. In the second phase, for every 30 epoch we reduce the learning rate by half. Also, an L2 weight decaying with coefficient 5 10 4 is enforced. |