NxMTransformer: Semi-Structured Sparsification for Natural Language Understanding via ADMM
Authors: Connor Holmes, Minjia Zhang, Yuxiong He, Bo Wu
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We apply our approach to a wide range of NLP tasks, and our proposed method is able to achieve 1.7 points higher accuracy in GLUE score than current best practices. Moreover, we perform detailed analysis on our approach and shed light on how ADMM affects fine-tuning accuracy for downstream tasks. |
| Researcher Affiliation | Collaboration | Connor Holmes Colorado School of Mines Golden, CO 80401 cholmes@mines.edu Minjia Zhang Microsoft Bellevue, WA 98004 minjiaz@microsoft.com Yuxiong He Microsoft Bellevue WA, 98004 yuxhe@microsoft.com Bo Wu Colorado School of Mines Golden, CO 80401 bwu@mines.edu |
| Pseudocode | No | The paper describes the optimization steps but does not provide structured pseudocode or an algorithm block. |
| Open Source Code | No | The paper states 'Nx MTransformer is implemented as a Py Torch [22] compatible library for sparsifying models with Nx M semi-structured sparsities. Furthermore, a Hugging Face Transformers [33] compatible Trainer is implemented to enable easy integration with their model collection and training scripts.' However, it does not provide an explicit statement of code release or a link to a repository for the described methodology. |
| Open Datasets | Yes | Dataset. We evaluate Nx MTransformer and our baselines using the the General Language Understanding Evaluation (GLUE) benchmark [31], a collection of NLP tasks varying in data availability and complexity. [31] Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop Blackbox NLP: Analyzing and Interpreting Neural Networks for NLP, pages 353 355, Brussels, Belgium, November 2018. Association for Computational Linguistics. |
| Dataset Splits | Yes | For all configurations, we set the fine-tune to have 5 epochs, and the best observed result on the validation set is reported. We report the evaluation results for BERT in Table 1 and make the following key observations. First, the pruning based method sparsifies weights of Transformer blocks but cannot explicit satisfy the underlying hardware constraints, e.g., the 4:2 sparsity. Although preserving the highest accuracy on downstream tasks (81.3 vs. 81.8 on average), the obtained sparse weights have a random structure of non-zero weights, which is inefficient to execute in modern hardware systems. As a result, the performance benefit with these unstructured sparsity based approaches is negligible, even when the pruning rate is high (e.g., 95%) [32]. Table 1: The dev set results on the GLUE benchmark. The results show that Nx MTransformer is able to achieve higher accuracy than ASP for Nx M sparsity, especially when the downstream tasks have low data resources. |
| Hardware Specification | Yes | All models were fine-tuned on an Intel Xeon 2630 v4 server with 2x NVIDIA Titan V running Ubuntu 18.04. |
| Software Dependencies | Yes | Py Torch version 1.7.1 was used alongside Transformers 4.3.2. |
| Experiment Setup | Yes | For the set of training hyperparameters used for training Nx MTransformer, see Table 3. We fine-tune BERT for 5 epochs on each downstream task. We perform a grid search of batch sizes 16 and 32, and learning rates 1e-5, 3e-5, and 5e-5 for SST-2, QNLI, and MNLI, due to their high training cost. Learning rates of 7e-5 and 9e-5 are additionally used for the remaining tasks. For masked fine-tune, the model was fine-tuned with learning rates 1e-5, 3e-5, 5e-5, 7e-5, and 9e-5 across batch sizes 16 and 32. ADMMUnstructured is trained using the same hyperparameters sweeps as Nx MTransformer. For all configurations, we set the fine-tune to have 5 epochs, and the best observed result on the validation set is reported. |