Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Monarch: Expressive Structured Matrices for Efficient and Accurate Training

Authors: Tri Dao, Beidi Chen, Nimit S Sohoni, Arjun Desai, Michael Poli, Jessica Grogan, Alexander Liu, Aniruddh Rao, Atri Rudra, Christopher Re

ICML 2022 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We empirically validate that Monarch can achieve favorable accuracy efﬁciency tradeoffs in several end-to-end sparse training applications: speeding up Vi T and GPT-2 training on Image Net classiﬁcation and Wikitext-103 language modeling by 2 with comparable model quality, and reducing the error on PDE solving and MRI reconstruction tasks by 40%. In sparse-to-dense training, with a simple technique called reverse sparsiﬁcation, Monarch matrices serve as a useful intermediate representation to speed up GPT-2 pretraining on Open Web Text by 2 without quality drop. The same technique brings 23% faster BERT pretraining than even the very optimized implementation from Nvidia that set the MLPerf 1.1 record. In dense-to-sparse ﬁne-tuning, asaproof-of-concept, our Monarchapproximation algorithm speeds up BERT ﬁne-tuning on GLUE by 1.7 with comparable accuracy.
Researcher Affiliation	Academia	1Stanford University, USA 2Carnegie Mellon University, USA 3University at Buffalo, SUNY, USA 4University of Michigan, USA.
Pseudocode	Yes	Algorithm 1 Projection on the set of Monarch matrices
Open Source Code	Yes	Monarch code is available at https://github.com/ Hazy Research/fly
Open Datasets	Yes	We use the popular vision benchmark, Image Net (Deng et al., 2009). For language modeling, we evaluate GPT-2 (Radford et al., 2019) on Wiki Text-103 (Merity et al., 2016). On the large Open Webtext dataset (Gokaslan et al., 2019)...On the Wikipedia + Book Corpus datasets (Zhu et al., 2015)...
Dataset Splits	Yes	In Table 15, we see that the sparse model trained end-to-end does not perform as well as the dense model, as language modeling performance on such large datasets tend to correlate strongly with number of parameters (scaling law). Transitioning from sparse to dense for the last 20% of training performs as well as the dense model. Table 15. GPT2 pretraining, with either sparse end-to-end or with sparse-to-dense training. Model Val perplexity Speedup GPT2-small 18.3
Hardware Specification	Yes	We measure the wall-clock training time on V100 GPUs. The total training time of BERT-large trained with Monarch reverse sparsiﬁcation and with conventional dense training on 8 A100-40GB GPUs (DGX A100).
Software Dependencies	No	The paper mentions software components like 'timm library', 'Huggingface transformers library', 'Nvidia’s Megatron-LM repo', 'LAMB optimizer', 'Apex’s automatic mix-precision (AMP) level O2', and 'Deep Speed Ze RO optimizer stage 1', but does not provide specific version numbers for these dependencies.
Experiment Setup	Yes	We adopt the hyperparameters (optimizer, learning rate, learning rate scheduler) from Yuan et al. (2021). Details are in Table 9. We report the hyperparameters used in Table 10 and Table 11. We use an effective batch size of 512, and use gradient accumulation to ﬁt into available GPU memory. We follow the training procedure and hyperparameters of the reference implementation from Nvidia Deep Learning examples.