MADA: Meta-Adaptive Optimizers Through Hyper-Gradient Descent
Authors: Kaan Ozkara, Can Karakus, Parameswaran Raman, Mingyi Hong, Shoham Sabach, Branislav Kveton, Volkan Cevher
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically compare MADA to other popular optimizers on vision and language tasks, and find that MADA consistently outperforms Adam and other popular optimizers, and is robust against sub-optimally tuned hyper-parameters. |
| Researcher Affiliation | Collaboration | 1Department of Electrical and Computer Engineering, University of California Los Angeles, USA 2Amazon Web Services 3Department of Electrical and Computer Engineering, University of Minnesota, USA 4Faculty of Data and Decision Sciences, Technion Israel Institute of Technology, Israel 5LIONS, IEM, STI, Ecole Polytechnique F ed erale de Lausanne, Switzerland. |
| Pseudocode | Yes | We provide a pseudo code to illustrate this (see Algorithm 1). Algorithm 1 Pseudocode for a generic MADA |
| Open Source Code | Yes | Our code is available at https://github.com/amazon-science/ mada_optimizer_search.5 |
| Open Datasets | Yes | We evaluate MADA on the causal language modeling task with GPT-2, over two datasets: Shakespeare (Karpathy, 2015), and Open Web Text (Gokaslan and Cohen, 2019). |
| Dataset Splits | No | The paper mentions training models on certain datasets and evaluating 'validation loss' and 'validation perplexities', but it does not explicitly provide specific percentages or sample counts for how the dataset itself was split into training, validation, and test sets. It describes training iterations and batch sizes, but not the overall dataset partition. |
| Hardware Specification | Yes | We run our experiments on AWS p5.48xlarge instances equipped with 8 NVIDIA H100 GPUs. |
| Software Dependencies | No | The paper mentions using 'Py Torch autograd machinery' and 'nano GPT code base' but does not specify version numbers for these or other software components. |
| Experiment Setup | Yes | On Open Web Text, we use a global batch size of 480 sequences, cosine learning rate schedule with the peak learning rate of 6 10 4 (1.5 10 4 for Lion) and the final learning rate of 1.5 10 5... For Open Web Text experiments, we use established parameters for Adam (β1 = 0.9, β2 = 0.95, ϵ = 10 6) and also use these values as the initial parameters for MADA, Hyper Adam, and AVGrad. |