reproducibilityindex.ai

ROSITA: Refined BERT cOmpreSsion with InTegrAted techniques

Authors: Yuanxin Liu, Zheng Lin, Fengcheng Yuan8715-8722

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	we integrate three kinds of compression methods (weight pruning, low-rank factorization and knowledge distillation (KD)) and explore a range of designs concerning model architecture, KD strategy, pruning frequency and learning rate schedule. We ﬁnd that a careful choice of the designs is crucial to the performance of the compressed model. Based on the empirical ﬁndings, our best compressed model, dubbed Reﬁned BERT c Ompre Ssion with In Tegr Ated techniques (ROSITA), is 7.5 smaller than BERT while maintains 98.5% of the performance on ﬁve tasks of the GLUE benchmark, outperforming the previous BERT compression methods with similar parameter budget.
Researcher Affiliation	Collaboration	Yuanxin Liu1,2 Zheng Lin1 * Fengcheng Yuan3 1Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China 2School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China 3Meituan Inc.
Pseudocode	No	The paper includes 'Figure 2: Illustration of different pruning and KD settings.' which visually explains procedures, but it is an illustration and not a formal pseudocode or algorithm block.
Open Source Code	Yes	The code is available at https://github.com/llyx97/Rosita.
Open Datasets	Yes	The General Language Understanding Evaluation (GLUE) benchmark (Wang et al. 2019) is a collection of diverse tasks... The data statistics are shown in Table 1.
Dataset Splits	Yes	The data statistics are shown in Table 1. ... In our exploratory study, we evaluate the performance on the development sets.
Hardware Specification	Yes	The inference time is measured on a single 24GB TITAN RTX GPU over the original MNLI training set (the batch size and maximum sequence length are set to 128).
Software Dependencies	No	We ﬁne-tune BERTBASE and the compressed models on the ﬁve GLUE tasks using the Hugging Face transformers library 2. ... The models are trained with Adam optimizer (Kingma and Ba 2015). The paper mentions software by name but does not specify version numbers for reproducibility.
Experiment Setup	Yes	The training hyperparameters are tuned separately for each task. We select the value of learning rate from {1e 5, 2e 5, 3e 5, 5e 5} and the value of batch size from {32, 64}. The range of the number of epoch varies across different settings. Due to space limitation, please refer to the code link for detailed hyperparameter settings.