ROSITA: Refined BERT cOmpreSsion with InTegrAted techniques
Authors: Yuanxin Liu, Zheng Lin, Fengcheng Yuan8715-8722
AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | we integrate three kinds of compression methods (weight pruning, low-rank factorization and knowledge distillation (KD)) and explore a range of designs concerning model architecture, KD strategy, pruning frequency and learning rate schedule. We find that a careful choice of the designs is crucial to the performance of the compressed model. Based on the empirical findings, our best compressed model, dubbed Refined BERT c Ompre Ssion with In Tegr Ated techniques (ROSITA), is 7.5 smaller than BERT while maintains 98.5% of the performance on five tasks of the GLUE benchmark, outperforming the previous BERT compression methods with similar parameter budget. |
| Researcher Affiliation | Collaboration | Yuanxin Liu1,2 Zheng Lin1 * Fengcheng Yuan3 1Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China 2School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China 3Meituan Inc. |
| Pseudocode | No | The paper includes 'Figure 2: Illustration of different pruning and KD settings.' which visually explains procedures, but it is an illustration and not a formal pseudocode or algorithm block. |
| Open Source Code | Yes | The code is available at https://github.com/llyx97/Rosita. |
| Open Datasets | Yes | The General Language Understanding Evaluation (GLUE) benchmark (Wang et al. 2019) is a collection of diverse tasks... The data statistics are shown in Table 1. |
| Dataset Splits | Yes | The data statistics are shown in Table 1. ... In our exploratory study, we evaluate the performance on the development sets. |
| Hardware Specification | Yes | The inference time is measured on a single 24GB TITAN RTX GPU over the original MNLI training set (the batch size and maximum sequence length are set to 128). |
| Software Dependencies | No | We fine-tune BERTBASE and the compressed models on the five GLUE tasks using the Hugging Face transformers library 2. ... The models are trained with Adam optimizer (Kingma and Ba 2015). The paper mentions software by name but does not specify version numbers for reproducibility. |
| Experiment Setup | Yes | The training hyperparameters are tuned separately for each task. We select the value of learning rate from {1e 5, 2e 5, 3e 5, 5e 5} and the value of batch size from {32, 64}. The range of the number of epoch varies across different settings. Due to space limitation, please refer to the code link for detailed hyperparameter settings. |