MicroAdam: Accurate Adaptive Optimization with Low Space Overhead and Provable Convergence

Authors: Ionut-Vlad Modoranu, Mher Safaryan, Grigory Malinovsky, Eldar Kurtić, Thomas Robert, Peter Richtarik, Dan Alistarh

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We complement our algorithmic and analytic results with an efficient GPU implementation of MICROADAM, which we validate for fine-tuning language models from the BERT [Devlin et al., 2018], OPT [Zhang et al., 2022] and LLa MA [Touvron et al., 2023] families, with hundreds of millions to billions of parameters. We now validate our optimizer experimentally. We focus on comparing MICROADAM with Adam, Adam-8bit, Ga Lore and CAME in the context of LLM finetuning on different tasks and with SGD, Adam and Adam W-8bit in the context of Res Nets on Image Net.
Researcher Affiliation Academia 1Institute of Science and Technology Austria (ISTA) 2King Abdullah University of Science and Technology (KAUST)
Pseudocode Yes We provide pseudocode in Algorithm 1 and highlight the parts related to error feedback quantization in blue.
Open Source Code Yes Our code is available at https://github.com/IST-DASLab/Micro Adam.
Open Datasets Yes BERT [Devlin et al., 2018], OPT [Zhang et al., 2022] and LLa MA [Touvron et al., 2023] families, GLUE/MNLI, GSM8k math reasoning dataset, Open-Platypus instruction tuning dataset, as well as pre-training Res Net models on Image Net.
Dataset Splits No For the 7B model, out results show that MICROADAM can allow accurate full fine-tuning of a 7B model on this task using a single 40GB GPU. Moreover, MICROADAM preserves accuracy relative to Adam, with lower memory usage than the well-optimized implementation of 8bit Adam W, and marginally lower running time for the shorter gradient window m = 10. We integrated our optimizer with the llm-foundry repository of Mosaic ML and tested via lm-evaluation-harness.
Hardware Specification Yes We run our experiments on NVidia GPUs A100-SXM4-80GB, H100-80GB and on RTX 3090 with 24GB RAM in single GPU setup.
Software Dependencies No The paper mentions software like Py Torch, CUDA, Hugging Face Transformers, llm-foundry, lm-evaluation-harness, and FFCV, but does not provide specific version numbers for any of these components.
Experiment Setup Yes We provide full details regarding training settings hyper-parameters in Appendix B. All Adam variants use default parameters β1 = 0.9, β2 = 0.999, ϵ = 10 8 and the regularization parameter λ is 0 for finetuning and 3e 4 for Image Net pre-training. MICROADAM uses a window size of m = 10 gradients with k = 1% density (equivalent to 99% sparsity and quantization bucket size is set to 64 for the error feedback. For GLUE/MNLI, we used the learning rate grid {1e 6, 3e 6, 5e 6, 7e 6, 1e 5, 3e 5, 5e 5, 7e 5} for all optimizers and models.