reproducibilityindex.ai

Lookahead Optimizer: k steps forward, 1 step back

Authors: Michael Zhang, James Lucas, Jimmy Ba, Geoffrey E. Hinton

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirically, we evaluate Lookahead by training classiﬁers on the CIFAR [19] and Image Net datasets [5], observing faster convergence on the Res Net-50 and Res Net-152 architectures [11]. We also trained LSTM language models on the Penn Treebank dataset [24] and Transformer-based [42] neural machine translation models on the WMT 2014 English-to-German dataset. For all tasks, using Lookahead leads to improved convergence over the inner optimizer and often improved generalization performance while being robust to hyperparameter changes.
Researcher Affiliation	Academia	Michael R. Zhang, James Lucas, Geoffrey Hinton, Jimmy Ba Department of Computer Science, University of Toronto, Vector Institute {michael, jlucas, hinton,jba}@cs.toronto.edu
Pseudocode	Yes	Figure 1: (Right) Pseudocode for Lookahead. Algorithm 1 Lookahead Optimizer:
Open Source Code	Yes	1Our open source implementation is available at https://github.com/michaelrzhang/lookahead.
Open Datasets	Yes	Empirically, we evaluate Lookahead by training classiﬁers on the CIFAR [19] and Image Net datasets [5]... We also trained LSTM language models on the Penn Treebank dataset [24] and Transformer-based [42] neural machine translation models on the WMT 2014 English-to-German dataset.
Dataset Splits	Yes	The CIFAR-10 and CIFAR-100 datasets for classiﬁcation consist of 32 32 color images, with 10 and 100 different classes, split into a training set with 50,000 images and a test set with 10,000 images... The 1000-way Image Net task [5] is a classiﬁcation task that contains roughly 1.28 million training images and 50,000 validation images.
Hardware Specification	Yes	We trained Transformer based models [42] on the WMT2014 English-to-German translation task on a single Tensor Processing Unit (TPU) node.
Software Dependencies	No	The paper mentions using PyTorch for the ImageNet implementation, but does not provide specific version numbers for PyTorch or other software dependencies, which would be necessary for reproduction.
Experiment Setup	Yes	We ran all our CIFAR experiments with 3 seeds and trained for 200 epochs on a Res Net-18 [11] with batches of 128 images and decay the learning rate by a factor of 5 at the 60th, 120th, and 160th epochs. ... Our baseline algorithm is SGD with an initial learning rate of 0.1 and momentum value of 0.9. We train for 90 epochs and decay our learning rate by a factor of 10 at the 30th and 60th epochs. For Lookahead, we set k = 5 and slow weights step size = 0.5.