Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
MARS: Unleashing the Power of Variance Reduction for Training Large Models
Authors: Huizhuo Yuan, Yifeng Liu, Shuang Wu, Zhou Xun, Quanquan Gu
ICML 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, we evaluated MARS on GPT-2 fine-tuning tasks using the Open Web Text dataset. It demonstrates superior performance on GPT-2 large: Adam W requires 50 billion tokens to reach a validation loss of 2.58, whereas MARS only requires 28 billion tokens, and it achieves a final validation loss of 2.51. Furthermore, on the downstream task Hellaswag, MARS improved accuracy to 44.64%, outperforming Adam W s 41.70% after training on 50 billion tokens. And the code is available at https://github.com/ AGI-Arena/MARS. |
| Researcher Affiliation | Collaboration | 1Department of Computer Science, University of California, Los Angeles, California, USA (This work was done during Yifeng s internship at Byte Dance Seed) 2Byte Dance Seed, San Jose, California, USA 3Byte Dance Seed, Beijing, China. Correspondence to: Quanquan Gu <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 MARS 1: input: x0, β1, {γt}, {ηt} 2: Set m0 0 and x1 x0 3: for t = 1, to n do 4: Sample ξt and let ct = f(xt, ξt) + γt β1 1 β1 f(xt, ξt) f(xt 1, ξt) 5: if ct 2 > 1, then ect = ct ||ct||2 else ect = ct 6: mt = β1mt 1 + (1 β1)ect 7: xt+1 = arg minx ηt mt, x + 1 2 x xt 2 Ht |
| Open Source Code | Yes | Empirically, we evaluated MARS on GPT-2 fine-tuning tasks using the Open Web Text dataset. It demonstrates superior performance on GPT-2 large: Adam W requires 50 billion tokens to reach a validation loss of 2.58, whereas MARS only requires 28 billion tokens, and it achieves a final validation loss of 2.51. Furthermore, on the downstream task Hellaswag, MARS improved accuracy to 44.64%, outperforming Adam W s 41.70% after training on 50 billion tokens. And the code is available at https://github.com/ AGI-Arena/MARS. |
| Open Datasets | Yes | All our experiments are done based on the nano GPT (Karpathy, 2022) implementation of the GPT-2 (Radford et al., 2019) architecture, and on the Open Web Text (Gokaslan et al., 2019) dataset. |
| Dataset Splits | Yes | The training and validation sets contain approximately 9 billion and 4.4 million tokens, respectively, all preprocessed using the GPT-2 tokenizer. We conduct experiments on three scales of GPT-2 models: small (125M parameters), medium (355M parameters), and large (770M parameters). |
| Hardware Specification | Yes | We utilized 16 NVIDIA A100 GPUs for training the small models. For the medium and large models, training was conducted on 32 NVIDIA A100 GPUs and 32 NVIDIA H100 GPUs, respectively. |
| Software Dependencies | No | The paper mentions "nano GPT (Karpathy, 2022) implementation" and "lm-evaluation-harness codebase (Gao et al., 2024)" but does not specify software versions for programming languages or libraries like PyTorch, TensorFlow, etc. that are critical for reproducibility. |
| Experiment Setup | Yes | Per the nano GPT configurations, we disabled biases, applied Ge LU activations, and set the Dropout rate (Srivastava et al., 2014) to 0.0. We utilized 16 NVIDIA A100 GPUs for training the small models. For the medium and large models, training was conducted on 32 NVIDIA A100 GPUs and 32 NVIDIA H100 GPUs, respectively. Other hyper-parameters of training are listed in Appendix F. |