Using Statistics to Automate Stochastic Optimization

Authors: Hunter Lang, Lin Xiao, Pengchuan Zhang

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments on several deep learning tasks demonstrate that this statistical adaptive stochastic approximation (SASA) method can automatically find good learning rate schedules and match the performance of hand-tuned methods using default settings of its parameters.
Researcher Affiliation Industry Hunter Lang Pengchuan Zhang Lin Xiao Microsoft Research AI Redmond, WA 98052, USA {hunter.lang, penzhan, lin.xiao}@microsoft.com
Pseudocode Yes Algorithm 1: General SASA method; Algorithm 2: SASA; Algorithm 3: Test
Open Source Code No The paper cites 'Pytorch word language model. https://github.com/pytorch/examples/tree/master/word_language_model, 2019.' which is a third-party example, but it does not provide access to the source code for the SASA methodology described in the paper.
Open Datasets Yes We trained an 18-layer Res Net model4 He et al. (He et al., 2016) on CIFAR-10 (Krizhevsky and Hinton, 2009); Image Net (Deng et al., 2009); We train the Py Torch word-level language model example (2019) on the Wikitext-2 dataset (Merity et al., 2016).
Dataset Splits Yes We compare against SGM and Adam with (global) learning rate tuned using a validation set. These baselines drop the learning rate by a factor of 4 when the validation loss stops improving.
Hardware Specification No The paper does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used for running the experiments.
Software Dependencies No The paper mentions 'Py Torch word-level language model example (2019)' which implies PyTorch, but no specific version number for PyTorch or other libraries is provided.
Experiment Setup Yes For all experiments, we use default values δ = 0.02 and γ = 0.2. In each experiment, we use the same 0 and as for the best SGM baseline. We use weight decay in every experiment... SGM-hand uses 0 = 1.0 and β = 0.9 and drops by a factor of 10 ( = 0.1) every 50 epochs. SASA uses γ = 0.2 and δ = 0.02, as always. Adam has a tuned global learning rate 0 = 0.0001 and a tuned warmup phase of 50 epochs... We trained an 18-layer Res Net model... with random cropping and random horizontal flipping for data augmentation and weight decay 0.0005. ... We used 600-dimensional embeddings, 600 hidden units, tied weights, and dropout 0.65, and gradient clipping with threshold 2.0.