Towards Understanding Why Lookahead Generalizes Better Than SGD and Beyond

Authors: Pan Zhou, Hanshu Yan, Xiaotong Yuan, Jiashi Feng, Shuicheng Yan

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results on CIFAR10/100 and Image Net testify its advantages.
Researcher Affiliation Collaboration Sea AI Lab, Singapore Nanjing University of Information Science & Technology, Nanjing, China {zhoupan, yanhanshu, fengjs, yansc}@sea.com xtyuan@nuist.edu.cn
Pseudocode Yes Algorithm 1: Lookahead Optimization Procedure (FS(θ), η, T, α, k, θ0, A, S) and Algorithm 2: Stagewise Locally-Regularized Look Ahead (SLRLA)
Open Source Code Yes Codes is available at https://github.com/sail-sg/SLRLA-optimizer.
Open Datasets Yes Experimental results on CIFAR10/100 and Image Net testify its advantages. Codes is available at https://github.com/sail-sg/SLRLA-optimizer. ... Here we investigate the effects of α on the performance of lookahead, stagewise lookahead [1] (SLA) and SLRLA on a regularized softmax problem with MNIST [56]. ... We evaluate SLA and SLRLA on CIFAR10/100 [58] and Image Net [59] using different network architectures...
Dataset Splits Yes Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] Please see the experimental settings in Sec. 6 and Appendix B.
Hardware Specification Yes We use two A100 GPUs on Image Net, and use single A100 GPU for all remaining experiments.
Software Dependencies No The paper does not provide specific version numbers for software dependencies.
Experiment Setup Yes Following our theory, we use a linearly decayed learning rate (LR) for lookahead, and multi-step decayed LRs for SLA/SLRLA. See more details in Appendix B. ... For all experiments, SLRLA and SLA set k=5, a momentum of 0.9, and a multi-stage learning rate (LR) decay at the {0.3S, 0.6S, 0.8S}-th epoch with total epoch number S. On CIFAR10/100, we train 200 epochs with α=0.8, a weight decay of 10-3, and set LR decay rate as 0.2. On Imagenet, we run 100 epochs using α=0.5, a weight decay of 10-4 and an LR decay rate of 0.1. ... For regularization constant βq, SLRLA selects it from {0.02, 0.2, 2.0, 20} via cross validation, and finally sets it as 0.2 on CIFAR10/100 and 20 on Image Net.