Prodigy: An Expeditiously Adaptive Parameter-Free Learner
Authors: Konstantin Mishchenko, Aaron Defazio
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experimental results show that our approach consistently outperforms D-Adaptation and reaches test accuracy values close to that of hand-tuned Adam. We test Prodigy on 12 common logistic-regression benchmark datasets, VGG11 and Res Net-50 training on CIFAR10, Vi T training on Imagenet, LSTM training on IWSLT14, DLRM training on Criteo dataset, Var Net on Knee MRI dataset, as well as Ro BERTa and GPT transformer training on Book Wiki. |
| Researcher Affiliation | Industry | 1Samsung AI Center 2Fundamental AI Research Team, Meta. |
| Pseudocode | Yes | Algorithm 1 Prodigy (GD version) |
| Open Source Code | Yes | Our open-source implementation is already widely used for fine-tuning of vision and language models, and is the recommended optimizer for Hugging Face Diffusers Dream Booth Lo RA training. The Py Torch code of our optimizer is available at https: //github.com/konstmish/prodigy |
| Open Datasets | Yes | We test Prodigy on 12 common logistic-regression benchmark datasets, VGG11 and Res Net-50 training on CIFAR10 (Krizhevsky, 2009), Vi T training on Imagenet, LSTM training on IWSLT14, DLRM training on Criteo dataset, Var Net on Knee MRI dataset, as well as Ro BERTa and GPT transformer training on Book Wiki. |
| Dataset Splits | No | The paper mentions evaluating on a 'validation set' in the context of Neural Architecture Search (NAS) as a general example, but it does not specify the training, validation, or test dataset splits for its own experiments (e.g., percentages or counts) for datasets like CIFAR10, ImageNet, etc. |
| Hardware Specification | No | The paper does not provide specific hardware details such as exact GPU/CPU models, processor types, or memory amounts used for running its experiments. |
| Software Dependencies | No | The paper mentions PyTorch and the use of Adam W, but it does not provide specific version numbers for these or any other software dependencies needed to replicate the experiments. |
| Experiment Setup | Yes | For neural network experiments, we consider training on CIFAR10 (Krizhevsky, 2009) with batch size 256... We use cosine annealing with initial step size 1 for all Adam-based methods and initial step size 10 3 for Adam itself... For all methods, we use batch size 256, clip the gradients to have norm not exceeding 1 and use float16 numbers. We use Adam W with hyperparameters given in the repository, i.e., β2 = 0.99, weight decay 0.1, step size 10 3, cosine annealing with warmup over 100 steps. The same weight decay value and cosine annealing is used for Prodigy and D-Adapted Adam, except that the latter two methods use step size 1. We accumulate minibatches of size 12 into a batch of size 480. We tuned the weight decay for Do G and L-Do G and found the value 10 4 to work well for this problem. |