Exploring extreme parameter compression for pre-trained language models

Authors: Benyou Wang, Yuxin Ren, Lifeng Shang, Xin Jiang, Qun Liu

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this work, we aim to explore larger compression ratios for PLMs, among which tensor decomposition is a potential but under-investigated one. Two decomposition and reconstruction protocols are further proposed to improve the effectiveness and efficiency during compression. Our compressed BERT 1 with 1/7 parameters in Transformer layers performs on-par with, sometimes slightly better than the original BERT in GLUE benchmark. A tiny version achieves 96.7% performance of BERT-base with 1/48 encoder parameters (i.e., less than 2M parameters excluding the embedding layer) and 2.7 faster on inference. and 6 EXPERIMENTS
Researcher Affiliation Collaboration Benyou Wang University of Padua wang@dei.unipd.it Tsinghua University ryx20@mails.tsinghua.edu.cn Lifeng Shang , Xin Jiang, Qun Liu Huawei Noah s Ark Lab Shang.Lifeng,Jiang.Xin,qun.liu@huawei.com
Pseudocode No The paper describes decomposition and reconstruction protocols using mathematical formulas and text, but it does not include any structured pseudocode or algorithm blocks.
Open Source Code Yes 1 https://github.com/twinkle0331/Xcompression
Open Datasets Yes GLUE evaluation GLUE (Wang et al., 2018) (see App. I for more details) includes datasets for single document classification and sentence pair classification.
Dataset Splits Yes GLUE evaluation GLUE (Wang et al., 2018) (see App. I for more details) includes datasets for single document classification and sentence pair classification. Fine-tuning and evaluation on GLUE follows the settings from Huggingface (Wolf et al., 2019). The best-performed model is selected according to the dev set, where we select the learning rate in [1e-5, 2e-5] and batch size in [16, 32]. and Table 8: Task descriptions and statistics in GLUE (Wang et al., 2018). NLI is for Natural Language Inference and QA is for Question Answering . SST-2, MNLI, QNLI, QQP are considered as relatively-big dataset according to the scale of their train set. train 8.5 k 67k 393k 3.7k 105k 364k 2.5k 7k test 1k 1.8k 20k 1.7k 5.4k 391k 3k 1.4k
Hardware Specification Yes Requests Per Second (RPS) is Throughput calculated by a single Nvidia V100 GPU (16G) using full GPU memory, see App. J for actual inference time.
Software Dependencies No Fine-tuning and evaluation on GLUE follows the settings from Huggingface (Wolf et al., 2019). While Huggingface is mentioned, no specific version number for it or other software dependencies is provided.
Experiment Setup Yes Knowledge distillation As (Jiao et al., 2020; Zhang et al., 2020; Bai et al., 2020) did, we use two-stage knowledge distillation for the compressed model. At General Distillation (GD) stage, we adopt Knowledge Distillation (KD) for the compressed model to simulate the last-layer hidden states and last-layer attention maps of the general teacher model (BERT-base). At the second stage, we adopt Task-specific Distillation (TD) to simulate the logits of a task-specific BERT model (e.g., fine-tuned on MNLI task). In GD, compressed models are trained with two epochs. In TD, we also augment training data by randomly replacing a random word with a similar word according to either word vector similarity using Glove (Pennington et al., 2014) or the predicted logistics of BERT when masking the target word, see more details in (Jiao et al., 2020). GLUE evaluation GLUE (Wang et al., 2018) (see App. I for more details) includes datasets for single document classification and sentence pair classification. Fine-tuning and evaluation on GLUE follows the settings from Huggingface (Wolf et al., 2019). The best-performed model is selected according to the dev set, where we select the learning rate in [1e-5, 2e-5] and batch size in [16, 32].