Learning-Rate-Free Learning by D-Adaptation
Authors: Aaron Defazio, Konstantin Mishchenko
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We present extensive experiments for SGD and Adam variants of our method, where the method automatically matches hand-tuned learning rates across more than a dozen diverse machine learning problems, including large-scale vision and language problems. |
| Researcher Affiliation | Collaboration | 1Meta AI, Fundamental AI Research (FAIR) team, New York 2CNRS, ENS, INRIA SIERRA team, France. |
| Pseudocode | Yes | Algorithm 1 Dual Averaging with D-Adaptation; Algorithm 2 Gradient Descent with D-Adaptation; Algorithm 3 D-Adapted Ada Grad; Algorithm 4 SGD with D-Adaptation; Algorithm 5 Adam with D-Adaptation |
| Open Source Code | Yes | An open-source implementation is available1. 1https://github.com/facebookresearch/ dadaptation |
| Open Datasets | Yes | For our convex experiments, we considered logistic regression applied to 12 commonly used benchmark problems from the LIBSVM repository. ... We used the three most common datasets used for optimization method testing: CIFAR10, CIFAR100 (Krizhevsky, 2009) and Image Net 2012 (Russakovsky et al., 2015). ... The IWSLT14 German-to-English dataset (Cettolo et al., 2014) is a standard choice for benchmarking machine translation models. ... We train on the Book-Wiki corpus (combining books from Zhu et al. (2015) and a snapshot of Wikipedia). ... The COCO 2017 object detection task is a popular benchmark in computer vision. ... The fast MRI Knee Dataset (Zbontar et al., 2018) is a large-scale release of raw MRI data. ... The Criteo Kaggle Display Advertising dataset is a large, sparse dataset of user clickthrough events. |
| Dataset Splits | Yes | The learning rate for Adam was chosen as the value that gave the highest final accuracy using a grid search. ... Unless otherwise mentioned, we used the standard learning rate schedule typically used for the problem, with the base learning rate set by D-Adaptation. |
| Hardware Specification | Yes | Table 3. CIFAR10 experiment. GPUs 1 V100. ... Table 5. Image Net experiment. GPUs 8 V100. |
| Software Dependencies | No | The paper mentions software like 'PyTorch Image Models framework', 'Detectron2', and 'Fair Seq', but does not provide specific version numbers for these software dependencies (e.g., 'PyTorch 1.9'). |
| Experiment Setup | Yes | Full hyper-parameter settings for each problem are included in the Appendix. ... Table 3. CIFAR10 experiment. Hyper-parameter Value: Epochs 300, Batch size per GPU 128, LR schedule 150-225 tenthing, Momentum 0.9, SGD LR 0.1. |