MicroAdam: Accurate Adaptive Optimization with Low Space Overhead and Provable Convergence
Authors: Ionut-Vlad Modoranu, Mher Safaryan, Grigory Malinovsky, Eldar Kurtić, Thomas Robert, Peter Richtarik, Dan Alistarh
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We complement our algorithmic and analytic results with an efficient GPU implementation of MICROADAM, which we validate for fine-tuning language models from the BERT [Devlin et al., 2018], OPT [Zhang et al., 2022] and LLa MA [Touvron et al., 2023] families, with hundreds of millions to billions of parameters. We now validate our optimizer experimentally. We focus on comparing MICROADAM with Adam, Adam-8bit, Ga Lore and CAME in the context of LLM finetuning on different tasks and with SGD, Adam and Adam W-8bit in the context of Res Nets on Image Net. |
| Researcher Affiliation | Academia | 1Institute of Science and Technology Austria (ISTA) 2King Abdullah University of Science and Technology (KAUST) |
| Pseudocode | Yes | We provide pseudocode in Algorithm 1 and highlight the parts related to error feedback quantization in blue. |
| Open Source Code | Yes | Our code is available at https://github.com/IST-DASLab/Micro Adam. |
| Open Datasets | Yes | BERT [Devlin et al., 2018], OPT [Zhang et al., 2022] and LLa MA [Touvron et al., 2023] families, GLUE/MNLI, GSM8k math reasoning dataset, Open-Platypus instruction tuning dataset, as well as pre-training Res Net models on Image Net. |
| Dataset Splits | No | For the 7B model, out results show that MICROADAM can allow accurate full fine-tuning of a 7B model on this task using a single 40GB GPU. Moreover, MICROADAM preserves accuracy relative to Adam, with lower memory usage than the well-optimized implementation of 8bit Adam W, and marginally lower running time for the shorter gradient window m = 10. We integrated our optimizer with the llm-foundry repository of Mosaic ML and tested via lm-evaluation-harness. |
| Hardware Specification | Yes | We run our experiments on NVidia GPUs A100-SXM4-80GB, H100-80GB and on RTX 3090 with 24GB RAM in single GPU setup. |
| Software Dependencies | No | The paper mentions software like Py Torch, CUDA, Hugging Face Transformers, llm-foundry, lm-evaluation-harness, and FFCV, but does not provide specific version numbers for any of these components. |
| Experiment Setup | Yes | We provide full details regarding training settings hyper-parameters in Appendix B. All Adam variants use default parameters β1 = 0.9, β2 = 0.999, ϵ = 10 8 and the regularization parameter λ is 0 for finetuning and 3e 4 for Image Net pre-training. MICROADAM uses a window size of m = 10 gradients with k = 1% density (equivalent to 99% sparsity and quantization bucket size is set to 64 for the error feedback. For GLUE/MNLI, we used the learning rate grid {1e 6, 3e 6, 5e 6, 7e 6, 1e 5, 3e 5, 5e 5, 7e 5} for all optimizers and models. |