Multiplicative Noise and Heavy Tails in Stochastic Optimization

Authors: Liam Hodgkinson, Michael Mahoney

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Theoretical results are obtained characterizing this for a large class of (non-linear and even non-convex) models and optimizers (including momentum, Adam, and stochastic Newton), demonstrating that this phenomenon holds generally. Furthermore, we empirically illustrate how multiplicative noise and heavy-tailed structure improve capacity for basin hopping and exploration of non-convex loss surfaces, over commonlyconsidered stochastic dynamics with only additive noise and light-tailed structure. Numerical experiments are conducted in 5, illustrating how multiplicative noise and heavy-tailed stationary behaviour improve the capacity for basin hopping (relative to light-tailed stationary behaviour) in the exploratory phase of learning.
Researcher Affiliation Academia 1ICSI and Department of Statistics, University of California, Berkeley, USA.
Pseudocode No The paper describes algorithms and formulations using mathematical equations and textual descriptions, but it does not include any structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide any specific links to source code repositories or explicitly state that the code for the described methodology is publicly available.
Open Datasets Yes To see this, we consider fitting a two-layer neural network with 16 hidden units for classification of the Musk data set (Dietterich et al., 1997) (168 attributes; 6598 instances) with cross-entropy loss without regularization and step size γ 10 2. We plot histograms of four common wide Res Net architectures trained on CIFAR10 in Figure 3, and provide maximum likelihood estimates of the tail exponents.
Dataset Splits No The paper mentions using the Musk dataset and CIFAR10 but does not specify the train/validation/test splits (e.g., percentages or exact counts) used for the experiments.
Hardware Specification No The paper does not provide specific details about the hardware used to run the experiments, such as GPU/CPU models or memory specifications.
Software Dependencies No The paper mentions 'powerlaw: a Python package' in its references, but it does not specify any software dependencies with version numbers used for running its own experiments.
Experiment Setup Yes For fixed step size γ 10 2 and initial w0 4.75, the distribution of 106 successive iterates are presented in Figure 1 for small (σ 2), moderate (σ 12), and strong (σ 50) noise. ... with cross-entropy loss without regularization and step size γ 10 2. Two stochastic optimizers are compared: (a) SGD with a single sample per batch (without replacement), and (b) perturbed GD (Jin et al., 2017), where the state-independent covariance of iterations in (b) is chosen to approximate that of (a) on average.