A Win-win Deal: Towards Sparse and Robust Pre-trained Language Models
Authors: Yuanxin Liu, Fandong Meng, Zheng Lin, Jiangnan Li, Peng Fu, Yanan Cao, Weiping Wang, Jie Zhou
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To this end, we conduct extensive experiments with the pre-trained BERT model on three natural language understanding (NLU) tasks. Our results demonstrate that sparse and robust subnetworks (SRNets) can consistently be found in BERT, across the aforementioned three scenarios, using different training and compression methods. |
| Researcher Affiliation | Collaboration | Yuanxin Liu1,2,3 , Fandong Meng5, Zheng Lin1,4 , Jiangnan Li1,4, Peng Fu1, Yanan Cao1,4, Weiping Wang1, Jie Zhou5 1Institute of Information Engineering, Chinese Academy of Sciences 2MOE Key Laboratory of Computational Linguistics, Peking University 3School of Computer Science, Peking University 4School of Cyber Security, University of Chinese Academy of Sciences 5Pattern Recognition Center, We Chat AI, Tencent Inc, China |
| Pseudocode | No | The complete mask training algorithm is summarized in Appendix A.1.2. |
| Open Source Code | Yes | The code is available at https://github.com/llyx97/sparse-and-robust-PLM. |
| Open Datasets | Yes | Natural Language Inference We use MNLI [44] as the ID dataset for NLI. ... To solve this problem, the OOD HANS dataset [30] is built so that such correlation does not hold. Paraphrase Identification The ID dataset for paraphrase identification is QQP 3, which contains question pairs... The OOD datasets PAWS-qqp and PAWS-wiki [50] are built from sentences... Fact Verification FEVER 4 [40] is adopted as the ID dataset... 3https://www.kaggle.com/c/quora-question-pairs 4See the licence information at https://fever.ai/download/fever/license.html |
| Dataset Splits | Yes | Unless otherwise specified, we select the best checkpoints based on the ID dev performance, without using OOD information. |
| Hardware Specification | No | Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] See Appendix B. |
| Software Dependencies | No | Mask training and IMP basically use the same hyper-parameters (adopting from [42]) as full BERT. An exception is longer training, because we find that good subnetworks at high sparsity levels require more training to be found. Unless otherwise specified, we select the best checkpoints based on the ID dev performance, without using OOD information. All the reported results are averaged over 4 runs. We defer training details about each dataset, and each training and pruning setup, to Appendix B.3. |
| Experiment Setup | No | Mask training and IMP basically use the same hyper-parameters (adopting from [42]) as full BERT. An exception is longer training, because we find that good subnetworks at high sparsity levels require more training to be found. Unless otherwise specified, we select the best checkpoints based on the ID dev performance, without using OOD information. All the reported results are averaged over 4 runs. We defer training details about each dataset, and each training and pruning setup, to Appendix B.3. |