The Road Less Scheduled

Authors: Aaron Defazio, Xingyu Yang, Ahmed Khaled, Konstantin Mishchenko, Harsh Mehta, Ashok Cutkosky

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We perform, to our knowledge, one of the largest machine learning optimization algorithm evaluations to date, consisting of 28 problems, ranging from logistic regression to large-scale deep learning problems.
Researcher Affiliation Collaboration Aaron Defazio1 Fundamental AI Research Team, Meta Xingyu (Alice) Yang2 Fundamental AI Research Team, Meta Harsh Mehta Google Research Konstantin Mishchenko Samsung AI Center Ahmed Khaled Princeton University Ashok Cutkosky3 Boston University
Pseudocode Yes Algorithm 1 Schedule-Free Adam W
Open Source Code Yes An open source implementation of our method is available1. 1https://github.com/facebookresearch/schedule_free
Open Datasets Yes For our deep learning experiments, we evaluated Schedule-Free learning on a set benchmark tasks that are commonly used in the optimization research literature: CIFAR10 A Wide Res Net (16-8) architecture (Zagoruyko and Komodakis, 2016) on the CIFAR10 image classification dataset. CIFAR100 A Dense Net (Huang et al., 2017) architecture on the CIFAR-100 (100-class) classification dataset. SVHN A deep Res Net architecture (3-96) on the Street View House Numbers (SVHN) dataset. Image Net A standard Res Net-50 architecture (He et al., 2016) on the ILSVRC 2012 Image Net (Russakovsky et al., 2015) classification dataset. IWSLT14 A LSTM architecture (Wiseman and Rush, 2016) on the IWSLT14 German-English translation dataset (Cettolo et al., 2014). DLRM The DLRM (Naumov et al., 2019) architecture on the Criteo Kaggle Display Advertising dataset (Jean-Baptiste Tien, 2014). MRI A stacked U-Net architecture (Sriram et al., 2020) on the fast MRI dataset (Zbontar et al., 2018). MAE Fine-tuning a pretrained Masked Autoencoder (He et al., 2021) Vi T (patch16-512d-8b) on the ILSVRC 2012 Image Net dataset. Nano GPT A 124M parameter GPT-2 (Radford et al., 2019) style decoder-only transformer on the Open Web Text dataset (Gokaslan and Cohen, 2019).
Dataset Splits No The paper uses well-known public datasets but does not explicitly state the training/validation/test splits for its own experiments, or provide information on how the validation data was specifically used or split from the main dataset.
Hardware Specification Yes GPUs 1 V100
Software Dependencies Yes Note that we found that training failed using Py Torch 2 or newer, and so we ran these experiments using Py Torch 1.9.
Experiment Setup Yes Hyper-parameter Value GPUs 1 V100 Batch size 16 Epochs 100 Seeds 10 Schedule-Free β1 0.9