reproducibilityindex.ai

Eliminating Sharp Minima from SGD with Truncated Heavy-tailed Noise

Authors: Xingyu Wang, Sewoong Oh, Chang-Han Rhee

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Real data experiments on deep learning conﬁrm our theoretical prediction that heavy-tailed SGD with gradient clipping ﬁnds a "ﬂatter" local minima and achieves better generalization. ... Section 3 presents numerical experiments that conﬁrm our theory. Section 4 proposes a new algorithm that artiﬁcially injects heavy tailed gradient noise in actual deep learning tasks and demonstrate the improved performance.
Researcher Affiliation	Academia	1Northwestern University, 2University of Washington
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code	No	The paper mentions adapting code from another work: "The experiment setting and choice of hyperparameters are mostly adapted from the experiment in Zhu et al. (2019).5https://github.com/uuujf/SGDNoise". However, it does not explicitly state that the authors' own modified or specific implementation code for the methods described in this paper (e.g., "Our 1" or "Our 2") is open-sourced or provide a direct link to it.
Open Datasets	Yes	We consider three different tasks: (1) Le Net (Le Cun et al., 1990) on corrupted Fashion MNIST (Xiao et al., 2017), (2) VGG11 (Simonyan & Zisserman, 2014) on SVHN (Netzer et al., 2011), (3) VGG11 on CIFAR10 (Krizhevsky et al., 2009) (see Appendix A for details).
Dataset Splits	No	The paper describes using training and test datasets. For instance, "For all tasks we use the entire test dataset when evaluating test accuracy." (Appendix A.3). However, there is no explicit mention of a separate validation dataset split used for hyperparameter tuning or model selection.
Hardware Specification	Yes	We ﬁrst mention that the all experiments using neural networks are conducted on Nvidia Ge Force GTX 1080 Ti.
Software Dependencies	No	The paper mentions models like Le Net and VGG11/VGG16 and references PyTorch indirectly through a GitHub link in the appendix, but it does not specify concrete version numbers for any software libraries or dependencies (e.g., PyTorch version, Python version, CUDA version) used in their experiments.
Experiment Setup	Yes	The experiment setting and choice of hyperparameters are adapted from (Zhu et al., 2019). ... Table A.2: Hyperparameters for training in the ablation study lists: learning rate 0.05, batch size for g SB 100, training iterations 10,000, gradient clipping threshold 5, c 0.5, α 1.4 for various tasks. ... In the ﬁrst phase (the ﬁrst 200 epochs), the learning rate is kept at a constant. In the second phase, for every 30 epoch we reduce the learning rate by half. Also, an L2 weight decaying with coefﬁcient 5 10 4 is enforced.