Language model compression with weighted low-rank factorization
Authors: Yen-Chang Hsu, Ting Hua, Sungen Chang, Qian Lou, Yilin Shen, Hongxia Jin
ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We perform analysis with the transformer-based language models, showing our weighted SVD largely alleviates the mismatched optimization objectives and can maintain model performance with a higher compression rate. Our method can directly compress a task-specific model while achieving better performance than other compact model strategies requiring expensive model pre-training. Moreover, the evaluation of compressing an already compact model shows our method can further reduce 9% to 30% parameters with an insignificant impact on task accuracy.Table 1: Results of Co NLL and GLUE benchmark. |
| Researcher Affiliation | Collaboration | Yen-Chang Hsu 1, Ting Hua 1, Sung-En Chang2, Qian Lou1, Yilin Shen1, and Hongxia Jin1 1Samsung Research America , 2Northeastern University |
| Pseudocode | No | The paper describes the mathematical formulation of SVD and FWSVD but does not include any pseudocode or algorithm blocks. |
| Open Source Code | No | The paper mentions using the 'popular Hugging Face Transformers library (Wolf et al., 2020)' but does not provide a statement or link for the open-source code of their own methodology. |
| Open Datasets | Yes | We evaluate the methods of all three paths in Figure 4 on the General Language Understanding Evaluation (GLUE) benchmark (Wang et al., 2019) and a token classification task. We include 2 single sentence tasks: Co LA (Warstadt et al., 2018) measured in Matthew s correlation, SST2 (Socher et al., 2013) measured in classification accuracy; 3 sentence similarity tasks: MRPC (Dolan et al., 2005) measured in F-1 score, STS-B (Cer et al., 2017) measured in Pearson-Spearman correlation, QQP (Chen et al., 2018b) measured in F-1 score; and 3 natural language inference tasks: MNLI (Williams et al., 2018) measured in classification accuracy with the average of the matched and mismatched subsets, QNLI (Rajpurkar et al., 2016) measured in accuracy. The token classification task we used is the named entity recognition (NER) on the Co NLL-2003 dataset (Sang & De Meulder, 2003). |
| Dataset Splits | Yes | For the SOTA models on path-1 (Mini LMv2 and Distil BERT), we use the pre-trained generic compact models (Sg) provided by the original authors as the starting point, then directly fine-tune them with 3 epochs on the target task training data. The fine-tuning is optimized by Adam with learning rate of 2 10 5 and batch size of 32 on one GPU. we directly report the results on the dev set of all the datasets, making the numbers convenient to compare and verify. |
| Hardware Specification | No | The paper mentions '384 NVIDIA V100 GPU hours' in the context of other works' pre-training costs, and 'one GPU' for fine-tuning in their own setup. However, it does not specify the exact GPU model or any other hardware specifications used for their experiments. |
| Software Dependencies | No | The paper states: 'Lastly, our implementation and experiments are built on top of the popular Hugging Face Transformers library (Wolf et al., 2020).' However, it does not provide specific version numbers for the library or any other software dependencies such as Python or PyTorch. |
| Experiment Setup | Yes | The fine-tuning is optimized by Adam with learning rate of 2 10 5 and batch size of 32 on one GPU. |