The Heavy-Tail Phenomenon in SGD

Authors: Mert Gurbuzbalaban, Umut Simsekli, Lingjiong Zhu

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we present our experimental results on both synthetic and real data, in order to illustrate that our theory also holds in finite-sum problems (besides the streaming setting). Our main goal will be to illustrate the tail behavior of SGD by varying the algorithm parameters: depending on the choice of the stepsize η and the batch-size b, the distribution of the iterates does converge to a heavy-tailed distribution (Theorem 2) and the behavior of the tail-index obeys Theorem 4. Our implementations can be found in github.com/umutsimsekli/sgd_ht.
Researcher Affiliation Academia 1Department of Management Science and Information Systems, Rutgers Business School, Piscataway, USA 2 INRIA Département d Informatique de l École Normale Supérieure PSL Research University, Paris, France 3Department of Mathematics, Florida State University, Tallahassee, USA.
Pseudocode No The paper describes algorithms such as SGD (1.3) but does not provide them in a structured pseudocode block or a clearly labeled 'Algorithm' section.
Open Source Code Yes Our implementations can be found in github.com/umutsimsekli/sgd_ht.
Open Datasets Yes We train the models by using SGD ... on the MNIST and CIFAR10 datasets.
Dataset Splits No The paper mentions using well-known datasets like MNIST and CIFAR10, but it does not provide specific percentages or counts for training, validation, or test splits. It mentions 'K = 1000 and K0 = 500' for averaging iterates in synthetic experiments, which refers to iteration setup, not dataset splitting.
Hardware Specification No The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments.
Software Dependencies No The paper mentions 'implementations' are available on GitHub, implying software usage, but it does not specify any software names with version numbers (e.g., 'Python 3.8', 'PyTorch 1.9', 'CUDA 11.1').
Experiment Setup Yes We set d = 100 first fix the variances σ = 1, σx = σy = 3, and generate {ai, yi}n i=1 by simulating the statistical model. Then, by fixing this dataset, we run the SGD recursion (3.5) for a large number of iterations and vary η from 0.02 to 0.2 and b from 1 to 20. We also set K = 1000 and K0 = 500. We train the models by using SGD for 10K iterations and we range η from 10 4 to 10 1 and b from 1 to 10. where we vary η from 10 4 to 1.7 10 3 and b from 1 to 10.