Lookahead Optimizer: k steps forward, 1 step back
Authors: Michael Zhang, James Lucas, Jimmy Ba, Geoffrey E. Hinton
NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, we evaluate Lookahead by training classifiers on the CIFAR [19] and Image Net datasets [5], observing faster convergence on the Res Net-50 and Res Net-152 architectures [11]. We also trained LSTM language models on the Penn Treebank dataset [24] and Transformer-based [42] neural machine translation models on the WMT 2014 English-to-German dataset. For all tasks, using Lookahead leads to improved convergence over the inner optimizer and often improved generalization performance while being robust to hyperparameter changes. |
| Researcher Affiliation | Academia | Michael R. Zhang, James Lucas, Geoffrey Hinton, Jimmy Ba Department of Computer Science, University of Toronto, Vector Institute {michael, jlucas, hinton,jba}@cs.toronto.edu |
| Pseudocode | Yes | Figure 1: (Right) Pseudocode for Lookahead. Algorithm 1 Lookahead Optimizer: |
| Open Source Code | Yes | 1Our open source implementation is available at https://github.com/michaelrzhang/lookahead. |
| Open Datasets | Yes | Empirically, we evaluate Lookahead by training classifiers on the CIFAR [19] and Image Net datasets [5]... We also trained LSTM language models on the Penn Treebank dataset [24] and Transformer-based [42] neural machine translation models on the WMT 2014 English-to-German dataset. |
| Dataset Splits | Yes | The CIFAR-10 and CIFAR-100 datasets for classification consist of 32 32 color images, with 10 and 100 different classes, split into a training set with 50,000 images and a test set with 10,000 images... The 1000-way Image Net task [5] is a classification task that contains roughly 1.28 million training images and 50,000 validation images. |
| Hardware Specification | Yes | We trained Transformer based models [42] on the WMT2014 English-to-German translation task on a single Tensor Processing Unit (TPU) node. |
| Software Dependencies | No | The paper mentions using PyTorch for the ImageNet implementation, but does not provide specific version numbers for PyTorch or other software dependencies, which would be necessary for reproduction. |
| Experiment Setup | Yes | We ran all our CIFAR experiments with 3 seeds and trained for 200 epochs on a Res Net-18 [11] with batches of 128 images and decay the learning rate by a factor of 5 at the 60th, 120th, and 160th epochs. ... Our baseline algorithm is SGD with an initial learning rate of 0.1 and momentum value of 0.9. We train for 90 epochs and decay our learning rate by a factor of 10 at the 30th and 60th epochs. For Lookahead, we set k = 5 and slow weights step size = 0.5. |