Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Monarch: Expressive Structured Matrices for Efficient and Accurate Training
Authors: Tri Dao, Beidi Chen, Nimit S Sohoni, Arjun Desai, Michael Poli, Jessica Grogan, Alexander Liu, Aniruddh Rao, Atri Rudra, Christopher Re
ICML 2022 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically validate that Monarch can achieve favorable accuracy efficiency tradeoffs in several end-to-end sparse training applications: speeding up Vi T and GPT-2 training on Image Net classification and Wikitext-103 language modeling by 2 with comparable model quality, and reducing the error on PDE solving and MRI reconstruction tasks by 40%. In sparse-to-dense training, with a simple technique called reverse sparsification, Monarch matrices serve as a useful intermediate representation to speed up GPT-2 pretraining on Open Web Text by 2 without quality drop. The same technique brings 23% faster BERT pretraining than even the very optimized implementation from Nvidia that set the MLPerf 1.1 record. In dense-to-sparse fine-tuning, asaproof-of-concept, our Monarchapproximation algorithm speeds up BERT fine-tuning on GLUE by 1.7 with comparable accuracy. |
| Researcher Affiliation | Academia | 1Stanford University, USA 2Carnegie Mellon University, USA 3University at Buffalo, SUNY, USA 4University of Michigan, USA. |
| Pseudocode | Yes | Algorithm 1 Projection on the set of Monarch matrices |
| Open Source Code | Yes | Monarch code is available at https://github.com/ Hazy Research/fly |
| Open Datasets | Yes | We use the popular vision benchmark, Image Net (Deng et al., 2009). For language modeling, we evaluate GPT-2 (Radford et al., 2019) on Wiki Text-103 (Merity et al., 2016). On the large Open Webtext dataset (Gokaslan et al., 2019)...On the Wikipedia + Book Corpus datasets (Zhu et al., 2015)... |
| Dataset Splits | Yes | In Table 15, we see that the sparse model trained end-to-end does not perform as well as the dense model, as language modeling performance on such large datasets tend to correlate strongly with number of parameters (scaling law). Transitioning from sparse to dense for the last 20% of training performs as well as the dense model. Table 15. GPT2 pretraining, with either sparse end-to-end or with sparse-to-dense training. Model Val perplexity Speedup GPT2-small 18.3 |
| Hardware Specification | Yes | We measure the wall-clock training time on V100 GPUs. The total training time of BERT-large trained with Monarch reverse sparsification and with conventional dense training on 8 A100-40GB GPUs (DGX A100). |
| Software Dependencies | No | The paper mentions software components like 'timm library', 'Huggingface transformers library', 'Nvidia’s Megatron-LM repo', 'LAMB optimizer', 'Apex’s automatic mix-precision (AMP) level O2', and 'Deep Speed Ze RO optimizer stage 1', but does not provide specific version numbers for these dependencies. |
| Experiment Setup | Yes | We adopt the hyperparameters (optimizer, learning rate, learning rate scheduler) from Yuan et al. (2021). Details are in Table 9. We report the hyperparameters used in Table 10 and Table 11. We use an effective batch size of 512, and use gradient accumulation to fit into available GPU memory. We follow the training procedure and hyperparameters of the reference implementation from Nvidia Deep Learning examples. |