Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Gradient Multi-Normalization for Efficient LLM Training
Authors: Meyer Scetbon, Chao Ma, Wenbo Gong, Ted Meeds
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we evaluate the empirical performance of applying Sink GD optimizer to LLM pretraining tasks. All experiments were performed on NVIDIA A100 GPUs. |
| Researcher Affiliation | Industry | Meyer Scetbon Microsoft Research Chao Ma Microsoft Research Wenbo Gong Microsoft Research Ted Meeds Microsoft Research |
| Pseudocode | Yes | Algorithm 1 Multi Norm( , L, g) Algorithm 2 Multi-Normalized GD (MNGD) Algorithm 3 SR-Sinkhorn( , L) Algorithm 4 Sinkhorn GD (Sink GD) |
| Open Source Code | No | Methods introduced in this paper will be integrated into an opensource repo. |
| Open Datasets | Yes | all trained on the C4 dataset Raffel et al. (2020) |
| Dataset Splits | Yes | We evaluate Sink GD on the memory-efficient LLa MA training benchmark proposed by Zhao et al. (2024a). This benchmark uses LLa MA-based architecture (Touvron et al., 2023) with RMSNorm and Swi GLU activations (Zhang & Sennrich, 2019; Gao et al., 2023). We consider models with 60M, 130M, 350M, and 1.3B parameters, all trained on the C4 dataset Raffel et al. (2020) using an effective token batch size of 130K tokens (total batch size 512, context length 256). Specifically, for both 130M and 350M, we use 128 batch size with 4 accumulations. For 60M and 1B, we uses 256 batch with 2 accumulation, and 32 per-device batch size with 2 accumulation and 8x A100s, respectively. Following the setup of Zhao et al. (2024a); Zhu et al. (2024), Sink GD is applied to all linear modules in both attention and MLP blocks with L = 5 iterations for the SR-Sinkhorn procedure. For all other modules, that are the embedding layer, the RMSnorm layers, and the last output layer, Adam optimizer Kingma & Ba (2015) is used. We use exactly the same cosine learning rate scheduler as in Zhao et al. (2024a), where 10% of total training steps is used for warm-up. |
| Hardware Specification | Yes | All experiments were performed on NVIDIA A100 GPUs. |
| Software Dependencies | No | The paper mentions various optimizers (Adam, SWAN, Apollo, Galore, Fira) and the LLaMA architecture. It also discusses the use of BF16 and FP32 precision. However, it does not list specific software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions). |
| Experiment Setup | Yes | using an effective token batch size of 130K tokens (total batch size 512, context length 256). Specifically, for both 130M and 350M, we use 128 batch size with 4 accumulations. For 60M and 1B, we uses 256 batch with 2 accumulation, and 32 per-device batch size with 2 accumulation and 8x A100s, respectively. Following the setup of Zhao et al. (2024a); Zhu et al. (2024), Sink GD is applied to all linear modules in both attention and MLP blocks with L = 5 iterations for the SR-Sinkhorn procedure. For all other modules, that are the embedding layer, the RMSnorm layers, and the last output layer, Adam optimizer Kingma & Ba (2015) is used. We use exactly the same cosine learning rate scheduler as in Zhao et al. (2024a), where 10% of total training steps is used for warm-up. Note that, as in Zhao et al. (2024a); Zhu et al. (2024), we use a group-wise learning rate for our optimizer. The effective learning rate used for linear modules in the transformer blocks is of the form αηt where ηt is global learning rate provided by the scheduler and α is fixed hyperparameter that we set to α = 0.05. For Adam, we use ηt as the learning rate. We also perform a grid search of learning rate for Adam over {0.01, 0.005, 0.001, 0.0005, 0.0001}, except for 1B model which we search over {0.001, 0.0007, 0.0005, 0.0003, 0.0001}. We do not perform any weight decay for all optimizers. |