Towards Lossless Head Pruning through Automatic Peer Distillation for Language Models
Authors: Bingbing Li, Zigeng Wang, Shaoyi Huang, Mikhail Bragin, Ji Li, Caiwen Ding
IJCAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results on the General Language Understanding Evaluation (GLUE) benchmark are provided using BERT model. By recycling discarded knowledge from pruned heads, the proposed method maintains model performance across all nine tasks while reducing heads by over 58% on average and outperforms state-of-the-art techniques (e.g., Random, HISP, L0 Norm, SMP). |
| Researcher Affiliation | Collaboration | 1University of Connecticut 2Microsoft Corporation |
| Pseudocode | Yes | Algorithm 1 Peer distillation head pruning procedure |
| Open Source Code | No | The paper does not provide a direct link to its source code or explicitly state that the code for its methodology is made publicly available. |
| Open Datasets | Yes | We test our method on GLUE benchmark [Wang et al., 2018] and report the performance of the unpruned and pruned models following the conventions by using accuracy for SST-2, QNLI, MNLI, QQP, RTE and WNLI; Matthews Correlation Coefficient (MCC) for Co LA, F1 scores for MRPC, and Spearman for STS-B. Our pre-trained model is the BERTBASE [Devlin et al., 2018] model. |
| Dataset Splits | Yes | We test our method on GLUE benchmark [Wang et al., 2018] and report the performance of the unpruned and pruned models following the conventions by using accuracy for SST-2, QNLI, MNLI, QQP, RTE and WNLI; Matthews Correlation Coefficient (MCC) for Co LA, F1 scores for MRPC, and Spearman for STS-B. Our pre-trained model is the BERTBASE [Devlin et al., 2018] model. We follow the default finetuning steps for 9 tasks according to Huggingface [Wolf et al., 2019] and obtain baseline models after training for 4 epochs. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory, or specific machine configurations) used for running its experiments. |
| Software Dependencies | No | The paper mentions 'Huggingface [Wolf et al., 2019]' but does not specify a version number for the library or any other software dependencies with their versions. |
| Experiment Setup | Yes | We follow the default finetuning steps for 9 tasks according to Huggingface [Wolf et al., 2019] and obtain baseline models after training for 4 epochs. We use the default initial learning rate (3e-5) to update model weights. For auxiliary parameters updating, a larger initial learning rate, lrg, enhances the ability of the gate optimizer to adjust gate parameters and leads to higher sparsity. While selecting different knowledge loss penalty factors, λ, from 0.2 to 0.5, we observe the similar compact model performance (values changed within 5.62% in F1 score and 0.01 in final training mixed loss) in Fig. 7. In experiments, we choose λ = 0.35 for the final evaluation. |