The Heavy-Tail Phenomenon in SGD
Authors: Mert Gurbuzbalaban, Umut Simsekli, Lingjiong Zhu
ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we present our experimental results on both synthetic and real data, in order to illustrate that our theory also holds in finite-sum problems (besides the streaming setting). Our main goal will be to illustrate the tail behavior of SGD by varying the algorithm parameters: depending on the choice of the stepsize η and the batch-size b, the distribution of the iterates does converge to a heavy-tailed distribution (Theorem 2) and the behavior of the tail-index obeys Theorem 4. Our implementations can be found in github.com/umutsimsekli/sgd_ht. |
| Researcher Affiliation | Academia | 1Department of Management Science and Information Systems, Rutgers Business School, Piscataway, USA 2 INRIA Département d Informatique de l École Normale Supérieure PSL Research University, Paris, France 3Department of Mathematics, Florida State University, Tallahassee, USA. |
| Pseudocode | No | The paper describes algorithms such as SGD (1.3) but does not provide them in a structured pseudocode block or a clearly labeled 'Algorithm' section. |
| Open Source Code | Yes | Our implementations can be found in github.com/umutsimsekli/sgd_ht. |
| Open Datasets | Yes | We train the models by using SGD ... on the MNIST and CIFAR10 datasets. |
| Dataset Splits | No | The paper mentions using well-known datasets like MNIST and CIFAR10, but it does not provide specific percentages or counts for training, validation, or test splits. It mentions 'K = 1000 and K0 = 500' for averaging iterates in synthetic experiments, which refers to iteration setup, not dataset splitting. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments. |
| Software Dependencies | No | The paper mentions 'implementations' are available on GitHub, implying software usage, but it does not specify any software names with version numbers (e.g., 'Python 3.8', 'PyTorch 1.9', 'CUDA 11.1'). |
| Experiment Setup | Yes | We set d = 100 first fix the variances σ = 1, σx = σy = 3, and generate {ai, yi}n i=1 by simulating the statistical model. Then, by fixing this dataset, we run the SGD recursion (3.5) for a large number of iterations and vary η from 0.02 to 0.2 and b from 1 to 20. We also set K = 1000 and K0 = 500. We train the models by using SGD for 10K iterations and we range η from 10 4 to 10 1 and b from 1 to 10. where we vary η from 10 4 to 1.7 10 3 and b from 1 to 10. |