Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Towards Understanding Why Lookahead Generalizes Better Than SGD and Beyond
Authors: Pan Zhou, Hanshu Yan, Xiaotong Yuan, Jiashi Feng, Shuicheng Yan
NeurIPS 2021 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results on CIFAR10/100 and Image Net testify its advantages. |
| Researcher Affiliation | Collaboration | Sea AI Lab, Singapore Nanjing University of Information Science & Technology, Nanjing, China EMAIL EMAIL |
| Pseudocode | Yes | Algorithm 1: Lookahead Optimization Procedure (FS(θ), η, T, α, k, θ0, A, S) and Algorithm 2: Stagewise Locally-Regularized Look Ahead (SLRLA) |
| Open Source Code | Yes | Codes is available at https://github.com/sail-sg/SLRLA-optimizer. |
| Open Datasets | Yes | Experimental results on CIFAR10/100 and Image Net testify its advantages. Codes is available at https://github.com/sail-sg/SLRLA-optimizer. ... Here we investigate the effects of α on the performance of lookahead, stagewise lookahead [1] (SLA) and SLRLA on a regularized softmax problem with MNIST [56]. ... We evaluate SLA and SLRLA on CIFAR10/100 [58] and Image Net [59] using different network architectures... |
| Dataset Splits | Yes | Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] Please see the experimental settings in Sec. 6 and Appendix B. |
| Hardware Specification | Yes | We use two A100 GPUs on Image Net, and use single A100 GPU for all remaining experiments. |
| Software Dependencies | No | The paper does not provide specific version numbers for software dependencies. |
| Experiment Setup | Yes | Following our theory, we use a linearly decayed learning rate (LR) for lookahead, and multi-step decayed LRs for SLA/SLRLA. See more details in Appendix B. ... For all experiments, SLRLA and SLA set k=5, a momentum of 0.9, and a multi-stage learning rate (LR) decay at the {0.3S, 0.6S, 0.8S}-th epoch with total epoch number S. On CIFAR10/100, we train 200 epochs with α=0.8, a weight decay of 10-3, and set LR decay rate as 0.2. On Imagenet, we run 100 epochs using α=0.5, a weight decay of 10-4 and an LR decay rate of 0.1. ... For regularization constant βq, SLRLA selects it from {0.02, 0.2, 2.0, 20} via cross validation, and finally sets it as 0.2 on CIFAR10/100 and 20 on Image Net. |