Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

MARS: Unleashing the Power of Variance Reduction for Training Large Models

Authors: Huizhuo Yuan, Yifeng Liu, Shuang Wu, Zhou Xun, Quanquan Gu

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirically, we evaluated MARS on GPT-2 fine-tuning tasks using the Open Web Text dataset. It demonstrates superior performance on GPT-2 large: Adam W requires 50 billion tokens to reach a validation loss of 2.58, whereas MARS only requires 28 billion tokens, and it achieves a final validation loss of 2.51. Furthermore, on the downstream task Hellaswag, MARS improved accuracy to 44.64%, outperforming Adam W s 41.70% after training on 50 billion tokens. And the code is available at https://github.com/ AGI-Arena/MARS.
Researcher Affiliation	Collaboration	1Department of Computer Science, University of California, Los Angeles, California, USA (This work was done during Yifeng s internship at Byte Dance Seed) 2Byte Dance Seed, San Jose, California, USA 3Byte Dance Seed, Beijing, China. Correspondence to: Quanquan Gu <EMAIL>.
Pseudocode	Yes	Algorithm 1 MARS 1: input: x0, β1, {γt}, {ηt} 2: Set m0 0 and x1 x0 3: for t = 1, to n do 4: Sample ξt and let ct = f(xt, ξt) + γt β1 1 β1 f(xt, ξt) f(xt 1, ξt) 5: if ct 2 > 1, then ect = ct \|\|ct\|\|2 else ect = ct 6: mt = β1mt 1 + (1 β1)ect 7: xt+1 = arg minx ηt mt, x + 1 2 x xt 2 Ht
Open Source Code	Yes	Empirically, we evaluated MARS on GPT-2 fine-tuning tasks using the Open Web Text dataset. It demonstrates superior performance on GPT-2 large: Adam W requires 50 billion tokens to reach a validation loss of 2.58, whereas MARS only requires 28 billion tokens, and it achieves a final validation loss of 2.51. Furthermore, on the downstream task Hellaswag, MARS improved accuracy to 44.64%, outperforming Adam W s 41.70% after training on 50 billion tokens. And the code is available at https://github.com/ AGI-Arena/MARS.
Open Datasets	Yes	All our experiments are done based on the nano GPT (Karpathy, 2022) implementation of the GPT-2 (Radford et al., 2019) architecture, and on the Open Web Text (Gokaslan et al., 2019) dataset.
Dataset Splits	Yes	The training and validation sets contain approximately 9 billion and 4.4 million tokens, respectively, all preprocessed using the GPT-2 tokenizer. We conduct experiments on three scales of GPT-2 models: small (125M parameters), medium (355M parameters), and large (770M parameters).
Hardware Specification	Yes	We utilized 16 NVIDIA A100 GPUs for training the small models. For the medium and large models, training was conducted on 32 NVIDIA A100 GPUs and 32 NVIDIA H100 GPUs, respectively.
Software Dependencies	No	The paper mentions "nano GPT (Karpathy, 2022) implementation" and "lm-evaluation-harness codebase (Gao et al., 2024)" but does not specify software versions for programming languages or libraries like PyTorch, TensorFlow, etc. that are critical for reproducibility.
Experiment Setup	Yes	Per the nano GPT configurations, we disabled biases, applied Ge LU activations, and set the Dropout rate (Srivastava et al., 2014) to 0.0. We utilized 16 NVIDIA A100 GPUs for training the small models. For the medium and large models, training was conducted on 32 NVIDIA A100 GPUs and 32 NVIDIA H100 GPUs, respectively. Other hyper-parameters of training are listed in Appendix F.