Towards Understanding Why Lookahead Generalizes Better Than SGD and Beyond
Authors: Pan Zhou, Hanshu Yan, Xiaotong Yuan, Jiashi Feng, Shuicheng Yan
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results on CIFAR10/100 and Image Net testify its advantages. |
| Researcher Affiliation | Collaboration | Sea AI Lab, Singapore Nanjing University of Information Science & Technology, Nanjing, China {zhoupan, yanhanshu, fengjs, yansc}@sea.com xtyuan@nuist.edu.cn |
| Pseudocode | Yes | Algorithm 1: Lookahead Optimization Procedure (FS(θ), η, T, α, k, θ0, A, S) and Algorithm 2: Stagewise Locally-Regularized Look Ahead (SLRLA) |
| Open Source Code | Yes | Codes is available at https://github.com/sail-sg/SLRLA-optimizer. |
| Open Datasets | Yes | Experimental results on CIFAR10/100 and Image Net testify its advantages. Codes is available at https://github.com/sail-sg/SLRLA-optimizer. ... Here we investigate the effects of α on the performance of lookahead, stagewise lookahead [1] (SLA) and SLRLA on a regularized softmax problem with MNIST [56]. ... We evaluate SLA and SLRLA on CIFAR10/100 [58] and Image Net [59] using different network architectures... |
| Dataset Splits | Yes | Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] Please see the experimental settings in Sec. 6 and Appendix B. |
| Hardware Specification | Yes | We use two A100 GPUs on Image Net, and use single A100 GPU for all remaining experiments. |
| Software Dependencies | No | The paper does not provide specific version numbers for software dependencies. |
| Experiment Setup | Yes | Following our theory, we use a linearly decayed learning rate (LR) for lookahead, and multi-step decayed LRs for SLA/SLRLA. See more details in Appendix B. ... For all experiments, SLRLA and SLA set k=5, a momentum of 0.9, and a multi-stage learning rate (LR) decay at the {0.3S, 0.6S, 0.8S}-th epoch with total epoch number S. On CIFAR10/100, we train 200 epochs with α=0.8, a weight decay of 10-3, and set LR decay rate as 0.2. On Imagenet, we run 100 epochs using α=0.5, a weight decay of 10-4 and an LR decay rate of 0.1. ... For regularization constant βq, SLRLA selects it from {0.02, 0.2, 2.0, 20} via cross validation, and finally sets it as 0.2 on CIFAR10/100 and 20 on Image Net. |