Monarch: Expressive Structured Matrices for Efficient and Accurate Training

Authors: Tri Dao, Beidi Chen, Nimit S Sohoni, Arjun Desai, Michael Poli, Jessica Grogan, Alexander Liu, Aniruddh Rao, Atri Rudra, Christopher Re

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically validate that Monarch can achieve favorable accuracy efficiency tradeoffs in several end-to-end sparse training applications: speeding up Vi T and GPT-2 training on Image Net classification and Wikitext-103 language modeling by 2 with comparable model quality, and reducing the error on PDE solving and MRI reconstruction tasks by 40%. In sparse-to-dense training, with a simple technique called reverse sparsification, Monarch matrices serve as a useful intermediate representation to speed up GPT-2 pretraining on Open Web Text by 2 without quality drop. The same technique brings 23% faster BERT pretraining than even the very optimized implementation from Nvidia that set the MLPerf 1.1 record. In dense-to-sparse fine-tuning, asaproof-of-concept, our Monarchapproximation algorithm speeds up BERT fine-tuning on GLUE by 1.7 with comparable accuracy.
Researcher Affiliation Academia 1Stanford University, USA 2Carnegie Mellon University, USA 3University at Buffalo, SUNY, USA 4University of Michigan, USA.
Pseudocode Yes Algorithm 1 Projection on the set of Monarch matrices
Open Source Code Yes Monarch code is available at https://github.com/ Hazy Research/fly
Open Datasets Yes We use the popular vision benchmark, Image Net (Deng et al., 2009). For language modeling, we evaluate GPT-2 (Radford et al., 2019) on Wiki Text-103 (Merity et al., 2016). On the large Open Webtext dataset (Gokaslan et al., 2019)...On the Wikipedia + Book Corpus datasets (Zhu et al., 2015)...
Dataset Splits Yes In Table 15, we see that the sparse model trained end-to-end does not perform as well as the dense model, as language modeling performance on such large datasets tend to correlate strongly with number of parameters (scaling law). Transitioning from sparse to dense for the last 20% of training performs as well as the dense model. Table 15. GPT2 pretraining, with either sparse end-to-end or with sparse-to-dense training. Model Val perplexity Speedup GPT2-small 18.3
Hardware Specification Yes We measure the wall-clock training time on V100 GPUs. The total training time of BERT-large trained with Monarch reverse sparsification and with conventional dense training on 8 A100-40GB GPUs (DGX A100).
Software Dependencies No The paper mentions software components like 'timm library', 'Huggingface transformers library', 'Nvidia’s Megatron-LM repo', 'LAMB optimizer', 'Apex’s automatic mix-precision (AMP) level O2', and 'Deep Speed Ze RO optimizer stage 1', but does not provide specific version numbers for these dependencies.
Experiment Setup Yes We adopt the hyperparameters (optimizer, learning rate, learning rate scheduler) from Yuan et al. (2021). Details are in Table 9. We report the hyperparameters used in Table 10 and Table 11. We use an effective batch size of 512, and use gradient accumulation to fit into available GPU memory. We follow the training procedure and hyperparameters of the reference implementation from Nvidia Deep Learning examples.