reproducibilityindex.ai

NxMTransformer: Semi-Structured Sparsification for Natural Language Understanding via ADMM

Authors: Connor Holmes, Minjia Zhang, Yuxiong He, Bo Wu

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We apply our approach to a wide range of NLP tasks, and our proposed method is able to achieve 1.7 points higher accuracy in GLUE score than current best practices. Moreover, we perform detailed analysis on our approach and shed light on how ADMM affects ﬁne-tuning accuracy for downstream tasks.
Researcher Affiliation	Collaboration	Connor Holmes Colorado School of Mines Golden, CO 80401 cholmes@mines.edu Minjia Zhang Microsoft Bellevue, WA 98004 minjiaz@microsoft.com Yuxiong He Microsoft Bellevue WA, 98004 yuxhe@microsoft.com Bo Wu Colorado School of Mines Golden, CO 80401 bwu@mines.edu
Pseudocode	No	The paper describes the optimization steps but does not provide structured pseudocode or an algorithm block.
Open Source Code	No	The paper states 'Nx MTransformer is implemented as a Py Torch [22] compatible library for sparsifying models with Nx M semi-structured sparsities. Furthermore, a Hugging Face Transformers [33] compatible Trainer is implemented to enable easy integration with their model collection and training scripts.' However, it does not provide an explicit statement of code release or a link to a repository for the described methodology.
Open Datasets	Yes	Dataset. We evaluate Nx MTransformer and our baselines using the the General Language Understanding Evaluation (GLUE) benchmark [31], a collection of NLP tasks varying in data availability and complexity. [31] Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop Blackbox NLP: Analyzing and Interpreting Neural Networks for NLP, pages 353 355, Brussels, Belgium, November 2018. Association for Computational Linguistics.
Dataset Splits	Yes	For all conﬁgurations, we set the ﬁne-tune to have 5 epochs, and the best observed result on the validation set is reported. We report the evaluation results for BERT in Table 1 and make the following key observations. First, the pruning based method sparsiﬁes weights of Transformer blocks but cannot explicit satisfy the underlying hardware constraints, e.g., the 4:2 sparsity. Although preserving the highest accuracy on downstream tasks (81.3 vs. 81.8 on average), the obtained sparse weights have a random structure of non-zero weights, which is inefﬁcient to execute in modern hardware systems. As a result, the performance beneﬁt with these unstructured sparsity based approaches is negligible, even when the pruning rate is high (e.g., 95%) [32]. Table 1: The dev set results on the GLUE benchmark. The results show that Nx MTransformer is able to achieve higher accuracy than ASP for Nx M sparsity, especially when the downstream tasks have low data resources.
Hardware Specification	Yes	All models were ﬁne-tuned on an Intel Xeon 2630 v4 server with 2x NVIDIA Titan V running Ubuntu 18.04.
Software Dependencies	Yes	Py Torch version 1.7.1 was used alongside Transformers 4.3.2.
Experiment Setup	Yes	For the set of training hyperparameters used for training Nx MTransformer, see Table 3. We ﬁne-tune BERT for 5 epochs on each downstream task. We perform a grid search of batch sizes 16 and 32, and learning rates 1e-5, 3e-5, and 5e-5 for SST-2, QNLI, and MNLI, due to their high training cost. Learning rates of 7e-5 and 9e-5 are additionally used for the remaining tasks. For masked ﬁne-tune, the model was ﬁne-tuned with learning rates 1e-5, 3e-5, 5e-5, 7e-5, and 9e-5 across batch sizes 16 and 32. ADMMUnstructured is trained using the same hyperparameters sweeps as Nx MTransformer. For all conﬁgurations, we set the ﬁne-tune to have 5 epochs, and the best observed result on the validation set is reported.