reproducibilityindex.ai

A Win-win Deal: Towards Sparse and Robust Pre-trained Language Models

Authors: Yuanxin Liu, Fandong Meng, Zheng Lin, Jiangnan Li, Peng Fu, Yanan Cao, Weiping Wang, Jie Zhou

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To this end, we conduct extensive experiments with the pre-trained BERT model on three natural language understanding (NLU) tasks. Our results demonstrate that sparse and robust subnetworks (SRNets) can consistently be found in BERT, across the aforementioned three scenarios, using different training and compression methods.
Researcher Affiliation	Collaboration	Yuanxin Liu1,2,3 , Fandong Meng5, Zheng Lin1,4 , Jiangnan Li1,4, Peng Fu1, Yanan Cao1,4, Weiping Wang1, Jie Zhou5 1Institute of Information Engineering, Chinese Academy of Sciences 2MOE Key Laboratory of Computational Linguistics, Peking University 3School of Computer Science, Peking University 4School of Cyber Security, University of Chinese Academy of Sciences 5Pattern Recognition Center, We Chat AI, Tencent Inc, China
Pseudocode	No	The complete mask training algorithm is summarized in Appendix A.1.2.
Open Source Code	Yes	The code is available at https://github.com/llyx97/sparse-and-robust-PLM.
Open Datasets	Yes	Natural Language Inference We use MNLI [44] as the ID dataset for NLI. ... To solve this problem, the OOD HANS dataset [30] is built so that such correlation does not hold. Paraphrase Identification The ID dataset for paraphrase identification is QQP 3, which contains question pairs... The OOD datasets PAWS-qqp and PAWS-wiki [50] are built from sentences... Fact Verification FEVER 4 [40] is adopted as the ID dataset... 3https://www.kaggle.com/c/quora-question-pairs 4See the licence information at https://fever.ai/download/fever/license.html
Dataset Splits	Yes	Unless otherwise specified, we select the best checkpoints based on the ID dev performance, without using OOD information.
Hardware Specification	No	Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] See Appendix B.
Software Dependencies	No	Mask training and IMP basically use the same hyper-parameters (adopting from [42]) as full BERT. An exception is longer training, because we find that good subnetworks at high sparsity levels require more training to be found. Unless otherwise specified, we select the best checkpoints based on the ID dev performance, without using OOD information. All the reported results are averaged over 4 runs. We defer training details about each dataset, and each training and pruning setup, to Appendix B.3.
Experiment Setup	No	Mask training and IMP basically use the same hyper-parameters (adopting from [42]) as full BERT. An exception is longer training, because we find that good subnetworks at high sparsity levels require more training to be found. Unless otherwise specified, we select the best checkpoints based on the ID dev performance, without using OOD information. All the reported results are averaged over 4 runs. We defer training details about each dataset, and each training and pruning setup, to Appendix B.3.