Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Adam-mini: Use Fewer Learning Rates To Gain More

Authors: Yushun Zhang, Congliang Chen, Ziniu Li, Tian Ding, Chenwei Wu, Diederik (Durk) Kingma, Yinyu Ye, Zhi-Quan Luo, Ruoyu Sun

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, we verify that Adam-mini performs on par or better than Adam W on various language models sized from 39M to 13B for pre-training, supervised fine-tuning, and RLHF. The reduced memory footprint of Adam-mini also alleviates communication overheads among GPUs, thereby increasing throughput. For instance, Adam-mini achieves 49.6% higher throughput than Adam W when pre-training Llama 2-7B on 2 A800-80GB GPUs, which saves 33% wall-clock time for pre-training.
Researcher Affiliation Collaboration Yushun Zhang 13, Congliang Chen 13, Ziniu Li13, Tian Ding23, Chenwei Wu4, Diederik P. Kingma5, Yinyu Ye16, Zhi-Quan Luo13, Ruoyu Sun 123 1 The Chinese University of Hong Kong, Shenzhen, China; 2 Shenzhen International Center for Industrial and Applied Mathematics; 3 Shenzhen Research Institute of Big Data; 4 Duke University; 5 Anthropic; 6 Stanford University
Pseudocode Yes We then provide one simple way to find good learning rates and propose Adam-mini. We provide a simple illustration in Figure 2 and relegate the complete form later in Algorithm 2. We summarize our main contribution as follows. New optimizer. We propose a new optimizer called Adam-mini. First, Adam-mini partitions the model parameters based on the principle we established upon the Hessian structure. Then, it chooses a single learning rate for each block using the average of Adam s v in that block.
Open Source Code Yes 1Our implementation of Adam-mini is available at https://github.com/zyushun/Adam-mini
Open Datasets Yes We pre-train LLMs including GPT-2 series and Llama series. We train these models on mainstream English Corpus from scratch. In particular, We train GPT-2 (Radford et al., 2019) series (125M to 1.5B) on Openwebtext (Gokaslan et al., 2019). We train Llama series (20M to 13B) (Touvron et al., 2023) on C4 (Raffel et al., 2020).
Dataset Splits Yes We use the ultrafeedback dataset 11. ... We train a SFT model with 40% of the chosen data and train a reward model using the remaining 60%.
Hardware Specification Yes All LLM experiments are conducted on four NVIDIA A800-80GB GPUs and the rest are conducted on four V100 GPUs.
Software Dependencies No The paper mentions "Py Torch default partition" and "DGL implementation" and various codebases like "nano GPT codebase", "Torchtitan codebase", "Re Max codebase", but does not provide specific version numbers for any software library or framework.
Experiment Setup Yes Unless mentioned otherwise, we choose the model configurations by their standard protocols. We choose the learning rates by the recommendation from open-source platforms if applicable. For instance, for GPT2 series, we use the recommended learning rates by (Liu et al., 2023), which are reported to be optimal by grid search. Unless mentioned otherwise, Adam-mini, Adafactor, CAME, SM3, and LAMB use the same learning rate as the recommended ones of Adam W. If there is no public recommended learning rate for Adam W, we tune the learning rate for all optimizers within the same computational budget and report the best performance. For other hyperparameters, we follow the recommendation from open-source platforms or by their default setting. For SM3 and Adafactor, we incorporate momentum with β1 = 0.9 to offer a fair comparison with other optimizers and the rest of the hyperparameters are set as default. The detailed configurations are explained as follows. GPT2 pre-training. We use the nano GPT codebase8 to train GPT2 sized 125M (small), 330M (medium), and 1.5B (XL) on Openwebtext. For all models, we use seq_len = 1024, batch size = 480, weight decay coefficient λ = 0.1, ϵ = 1e-8, β1 = 0.9, β2 = 0.95. We use cosine-decay learning rate schedule with 2000 iterations of warm-up. For GPT2-small and medium, we use the recommended peak learning rate by (Liu et al., 2023), which are reported to be the optimal ones found by grid search. For GPT2-XL, we use the recommended peak learning rate by the Levanter9. The chosen peak learning rates are 6e-4, 3e-4, 1e-4 for GPT2-small, medium, XL, respectively. The minimal learning rate is chosen as 3e-5, 6e-5, 1e-5 for these models.