On the Overlooked Structure of Stochastic Gradients

Authors: Zeke Xie, Qian-Yuan Tang, Mingming Sun, Ping Li

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, we mainly make two contributions. First, we conduct formal statistical tests on the distribution of stochastic gradients and gradient noise across both parameters and iterations. Our statistical tests reveal that dimension-wise gradients usually exhibit power-law heavy tails, while iteration-wise gradients and stochastic gradient noise caused by minibatch training usually do not exhibit power-law heavy tails. and The experiments are conducted on a computing cluster with NVIDIA V100 GPUs and Intel Xeon CPUs.
Researcher Affiliation Collaboration Zeke Xie1, Qian-Yuan Tang2, Mingming Sun1 and Ping Li1 1Cognitive Computing Lab, Baidu Research 2Department of Physics, Hong Kong Baptist University Correspondence: xiezeke@baidu.com,tangqy@hkbu.edu.hk
Pseudocode No No pseudocode or algorithm blocks were found in the paper.
Open Source Code No The paper states 'We can produce/reproduce the main experiments using Paddle Paddle (Ma et al., 2019) and Py Torch (Paszke et al., 2019).', which refers to frameworks used, not the authors' specific open-source code for their methodology. No explicit statement or link to the authors' own code repository was found.
Open Datasets Yes Dataset: MNIST (Le Cun, 1998), CIFAR-10/100 (Krizhevsky and Hinton, 2009), and Avila (De Stefano et al., 2018).
Dataset Splits No The paper mentions using datasets like MNIST and CIFAR-10/100 and setting hyperparameters like batch size and number of epochs, but it does not explicitly provide specific training/validation/test dataset split percentages, absolute sample counts, or refer to predefined splits with specific citations within the text that would allow reproduction of the data partitioning.
Hardware Specification Yes The experiments are conducted on a computing cluster with NVIDIA V100 GPUs and Intel Xeon CPUs.
Software Dependencies No The paper states 'We can produce/reproduce the main experiments using Paddle Paddle (Ma et al., 2019) and Py Torch (Paszke et al., 2019).', but it does not provide specific version numbers for these software libraries or any other key ancillary software components.
Experiment Setup Yes Pretraining Hyperparameter Settings: We train neural networks for 50 epochs on MNIST for obtaining pretrained models. For the learning rate schedule, the learning rate is divided by 10 at the epoch of 40% and 80%. We use η = 0.1 for SGD/Momentum and η = 0.001 for Adam. The batch size is set to 128. The strength of weight decay defaults to λ = 0.0005 for pretrained models. We set the momentum hyperparameter β1 = 0.9 for SGD Momentum. As for other optimizer hyperparameters, we apply the default settings directly.