Learning-Rate-Free Learning by D-Adaptation

Authors: Aaron Defazio, Konstantin Mishchenko

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We present extensive experiments for SGD and Adam variants of our method, where the method automatically matches hand-tuned learning rates across more than a dozen diverse machine learning problems, including large-scale vision and language problems.
Researcher Affiliation Collaboration 1Meta AI, Fundamental AI Research (FAIR) team, New York 2CNRS, ENS, INRIA SIERRA team, France.
Pseudocode Yes Algorithm 1 Dual Averaging with D-Adaptation; Algorithm 2 Gradient Descent with D-Adaptation; Algorithm 3 D-Adapted Ada Grad; Algorithm 4 SGD with D-Adaptation; Algorithm 5 Adam with D-Adaptation
Open Source Code Yes An open-source implementation is available1. 1https://github.com/facebookresearch/ dadaptation
Open Datasets Yes For our convex experiments, we considered logistic regression applied to 12 commonly used benchmark problems from the LIBSVM repository. ... We used the three most common datasets used for optimization method testing: CIFAR10, CIFAR100 (Krizhevsky, 2009) and Image Net 2012 (Russakovsky et al., 2015). ... The IWSLT14 German-to-English dataset (Cettolo et al., 2014) is a standard choice for benchmarking machine translation models. ... We train on the Book-Wiki corpus (combining books from Zhu et al. (2015) and a snapshot of Wikipedia). ... The COCO 2017 object detection task is a popular benchmark in computer vision. ... The fast MRI Knee Dataset (Zbontar et al., 2018) is a large-scale release of raw MRI data. ... The Criteo Kaggle Display Advertising dataset is a large, sparse dataset of user clickthrough events.
Dataset Splits Yes The learning rate for Adam was chosen as the value that gave the highest final accuracy using a grid search. ... Unless otherwise mentioned, we used the standard learning rate schedule typically used for the problem, with the base learning rate set by D-Adaptation.
Hardware Specification Yes Table 3. CIFAR10 experiment. GPUs 1 V100. ... Table 5. Image Net experiment. GPUs 8 V100.
Software Dependencies No The paper mentions software like 'PyTorch Image Models framework', 'Detectron2', and 'Fair Seq', but does not provide specific version numbers for these software dependencies (e.g., 'PyTorch 1.9').
Experiment Setup Yes Full hyper-parameter settings for each problem are included in the Appendix. ... Table 3. CIFAR10 experiment. Hyper-parameter Value: Epochs 300, Batch size per GPU 128, LR schedule 150-225 tenthing, Momentum 0.9, SGD LR 0.1.