reproducibilityindex.ai

Prodigy: An Expeditiously Adaptive Parameter-Free Learner

Authors: Konstantin Mishchenko, Aaron Defazio

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experimental results show that our approach consistently outperforms D-Adaptation and reaches test accuracy values close to that of hand-tuned Adam. We test Prodigy on 12 common logistic-regression benchmark datasets, VGG11 and Res Net-50 training on CIFAR10, Vi T training on Imagenet, LSTM training on IWSLT14, DLRM training on Criteo dataset, Var Net on Knee MRI dataset, as well as Ro BERTa and GPT transformer training on Book Wiki.
Researcher Affiliation	Industry	1Samsung AI Center 2Fundamental AI Research Team, Meta.
Pseudocode	Yes	Algorithm 1 Prodigy (GD version)
Open Source Code	Yes	Our open-source implementation is already widely used for fine-tuning of vision and language models, and is the recommended optimizer for Hugging Face Diffusers Dream Booth Lo RA training. The Py Torch code of our optimizer is available at https: //github.com/konstmish/prodigy
Open Datasets	Yes	We test Prodigy on 12 common logistic-regression benchmark datasets, VGG11 and Res Net-50 training on CIFAR10 (Krizhevsky, 2009), Vi T training on Imagenet, LSTM training on IWSLT14, DLRM training on Criteo dataset, Var Net on Knee MRI dataset, as well as Ro BERTa and GPT transformer training on Book Wiki.
Dataset Splits	No	The paper mentions evaluating on a 'validation set' in the context of Neural Architecture Search (NAS) as a general example, but it does not specify the training, validation, or test dataset splits for its own experiments (e.g., percentages or counts) for datasets like CIFAR10, ImageNet, etc.
Hardware Specification	No	The paper does not provide specific hardware details such as exact GPU/CPU models, processor types, or memory amounts used for running its experiments.
Software Dependencies	No	The paper mentions PyTorch and the use of Adam W, but it does not provide specific version numbers for these or any other software dependencies needed to replicate the experiments.
Experiment Setup	Yes	For neural network experiments, we consider training on CIFAR10 (Krizhevsky, 2009) with batch size 256... We use cosine annealing with initial step size 1 for all Adam-based methods and initial step size 10 3 for Adam itself... For all methods, we use batch size 256, clip the gradients to have norm not exceeding 1 and use float16 numbers. We use Adam W with hyperparameters given in the repository, i.e., β2 = 0.99, weight decay 0.1, step size 10 3, cosine annealing with warmup over 100 steps. The same weight decay value and cosine annealing is used for Prodigy and D-Adapted Adam, except that the latter two methods use step size 1. We accumulate minibatches of size 12 into a batch of size 480. We tuned the weight decay for Do G and L-Do G and found the value 10 4 to work well for this problem.