Using Statistics to Automate Stochastic Optimization
Authors: Hunter Lang, Lin Xiao, Pengchuan Zhang
NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments on several deep learning tasks demonstrate that this statistical adaptive stochastic approximation (SASA) method can automatically find good learning rate schedules and match the performance of hand-tuned methods using default settings of its parameters. |
| Researcher Affiliation | Industry | Hunter Lang Pengchuan Zhang Lin Xiao Microsoft Research AI Redmond, WA 98052, USA {hunter.lang, penzhan, lin.xiao}@microsoft.com |
| Pseudocode | Yes | Algorithm 1: General SASA method; Algorithm 2: SASA; Algorithm 3: Test |
| Open Source Code | No | The paper cites 'Pytorch word language model. https://github.com/pytorch/examples/tree/master/word_language_model, 2019.' which is a third-party example, but it does not provide access to the source code for the SASA methodology described in the paper. |
| Open Datasets | Yes | We trained an 18-layer Res Net model4 He et al. (He et al., 2016) on CIFAR-10 (Krizhevsky and Hinton, 2009); Image Net (Deng et al., 2009); We train the Py Torch word-level language model example (2019) on the Wikitext-2 dataset (Merity et al., 2016). |
| Dataset Splits | Yes | We compare against SGM and Adam with (global) learning rate tuned using a validation set. These baselines drop the learning rate by a factor of 4 when the validation loss stops improving. |
| Hardware Specification | No | The paper does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used for running the experiments. |
| Software Dependencies | No | The paper mentions 'Py Torch word-level language model example (2019)' which implies PyTorch, but no specific version number for PyTorch or other libraries is provided. |
| Experiment Setup | Yes | For all experiments, we use default values δ = 0.02 and γ = 0.2. In each experiment, we use the same 0 and as for the best SGM baseline. We use weight decay in every experiment... SGM-hand uses 0 = 1.0 and β = 0.9 and drops by a factor of 10 ( = 0.1) every 50 epochs. SASA uses γ = 0.2 and δ = 0.02, as always. Adam has a tuned global learning rate 0 = 0.0001 and a tuned warmup phase of 50 epochs... We trained an 18-layer Res Net model... with random cropping and random horizontal flipping for data augmentation and weight decay 0.0005. ... We used 600-dimensional embeddings, 600 hidden units, tied weights, and dropout 0.65, and gradient clipping with threshold 2.0. |