Multiplicative Noise and Heavy Tails in Stochastic Optimization
Authors: Liam Hodgkinson, Michael Mahoney
ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Theoretical results are obtained characterizing this for a large class of (non-linear and even non-convex) models and optimizers (including momentum, Adam, and stochastic Newton), demonstrating that this phenomenon holds generally. Furthermore, we empirically illustrate how multiplicative noise and heavy-tailed structure improve capacity for basin hopping and exploration of non-convex loss surfaces, over commonlyconsidered stochastic dynamics with only additive noise and light-tailed structure. Numerical experiments are conducted in 5, illustrating how multiplicative noise and heavy-tailed stationary behaviour improve the capacity for basin hopping (relative to light-tailed stationary behaviour) in the exploratory phase of learning. |
| Researcher Affiliation | Academia | 1ICSI and Department of Statistics, University of California, Berkeley, USA. |
| Pseudocode | No | The paper describes algorithms and formulations using mathematical equations and textual descriptions, but it does not include any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide any specific links to source code repositories or explicitly state that the code for the described methodology is publicly available. |
| Open Datasets | Yes | To see this, we consider fitting a two-layer neural network with 16 hidden units for classification of the Musk data set (Dietterich et al., 1997) (168 attributes; 6598 instances) with cross-entropy loss without regularization and step size γ 10 2. We plot histograms of four common wide Res Net architectures trained on CIFAR10 in Figure 3, and provide maximum likelihood estimates of the tail exponents. |
| Dataset Splits | No | The paper mentions using the Musk dataset and CIFAR10 but does not specify the train/validation/test splits (e.g., percentages or exact counts) used for the experiments. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used to run the experiments, such as GPU/CPU models or memory specifications. |
| Software Dependencies | No | The paper mentions 'powerlaw: a Python package' in its references, but it does not specify any software dependencies with version numbers used for running its own experiments. |
| Experiment Setup | Yes | For fixed step size γ 10 2 and initial w0 4.75, the distribution of 106 successive iterates are presented in Figure 1 for small (σ 2), moderate (σ 12), and strong (σ 50) noise. ... with cross-entropy loss without regularization and step size γ 10 2. Two stochastic optimizers are compared: (a) SGD with a single sample per batch (without replacement), and (b) perturbed GD (Jin et al., 2017), where the state-independent covariance of iterations in (b) is chosen to approximate that of (a) on average. |