MADA: Meta-Adaptive Optimizers Through Hyper-Gradient Descent

Authors: Kaan Ozkara, Can Karakus, Parameswaran Raman, Mingyi Hong, Shoham Sabach, Branislav Kveton, Volkan Cevher

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically compare MADA to other popular optimizers on vision and language tasks, and find that MADA consistently outperforms Adam and other popular optimizers, and is robust against sub-optimally tuned hyper-parameters.
Researcher Affiliation Collaboration 1Department of Electrical and Computer Engineering, University of California Los Angeles, USA 2Amazon Web Services 3Department of Electrical and Computer Engineering, University of Minnesota, USA 4Faculty of Data and Decision Sciences, Technion Israel Institute of Technology, Israel 5LIONS, IEM, STI, Ecole Polytechnique F ed erale de Lausanne, Switzerland.
Pseudocode Yes We provide a pseudo code to illustrate this (see Algorithm 1). Algorithm 1 Pseudocode for a generic MADA
Open Source Code Yes Our code is available at https://github.com/amazon-science/ mada_optimizer_search.5
Open Datasets Yes We evaluate MADA on the causal language modeling task with GPT-2, over two datasets: Shakespeare (Karpathy, 2015), and Open Web Text (Gokaslan and Cohen, 2019).
Dataset Splits No The paper mentions training models on certain datasets and evaluating 'validation loss' and 'validation perplexities', but it does not explicitly provide specific percentages or sample counts for how the dataset itself was split into training, validation, and test sets. It describes training iterations and batch sizes, but not the overall dataset partition.
Hardware Specification Yes We run our experiments on AWS p5.48xlarge instances equipped with 8 NVIDIA H100 GPUs.
Software Dependencies No The paper mentions using 'Py Torch autograd machinery' and 'nano GPT code base' but does not specify version numbers for these or other software components.
Experiment Setup Yes On Open Web Text, we use a global batch size of 480 sequences, cosine learning rate schedule with the peak learning rate of 6 10 4 (1.5 10 4 for Lion) and the final learning rate of 1.5 10 5... For Open Web Text experiments, we use established parameters for Adam (β1 = 0.9, β2 = 0.95, ϵ = 10 6) and also use these values as the initial parameters for MADA, Hyper Adam, and AVGrad.