reproducibilityindex.ai

Towards Lossless Head Pruning through Automatic Peer Distillation for Language Models

Authors: Bingbing Li, Zigeng Wang, Shaoyi Huang, Mikhail Bragin, Ji Li, Caiwen Ding

IJCAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results on the General Language Understanding Evaluation (GLUE) benchmark are provided using BERT model. By recycling discarded knowledge from pruned heads, the proposed method maintains model performance across all nine tasks while reducing heads by over 58% on average and outperforms state-of-the-art techniques (e.g., Random, HISP, L0 Norm, SMP).
Researcher Affiliation	Collaboration	1University of Connecticut 2Microsoft Corporation
Pseudocode	Yes	Algorithm 1 Peer distillation head pruning procedure
Open Source Code	No	The paper does not provide a direct link to its source code or explicitly state that the code for its methodology is made publicly available.
Open Datasets	Yes	We test our method on GLUE benchmark [Wang et al., 2018] and report the performance of the unpruned and pruned models following the conventions by using accuracy for SST-2, QNLI, MNLI, QQP, RTE and WNLI; Matthews Correlation Coefficient (MCC) for Co LA, F1 scores for MRPC, and Spearman for STS-B. Our pre-trained model is the BERTBASE [Devlin et al., 2018] model.
Dataset Splits	Yes	We test our method on GLUE benchmark [Wang et al., 2018] and report the performance of the unpruned and pruned models following the conventions by using accuracy for SST-2, QNLI, MNLI, QQP, RTE and WNLI; Matthews Correlation Coefficient (MCC) for Co LA, F1 scores for MRPC, and Spearman for STS-B. Our pre-trained model is the BERTBASE [Devlin et al., 2018] model. We follow the default finetuning steps for 9 tasks according to Huggingface [Wolf et al., 2019] and obtain baseline models after training for 4 epochs.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU/CPU models, memory, or specific machine configurations) used for running its experiments.
Software Dependencies	No	The paper mentions 'Huggingface [Wolf et al., 2019]' but does not specify a version number for the library or any other software dependencies with their versions.
Experiment Setup	Yes	We follow the default finetuning steps for 9 tasks according to Huggingface [Wolf et al., 2019] and obtain baseline models after training for 4 epochs. We use the default initial learning rate (3e-5) to update model weights. For auxiliary parameters updating, a larger initial learning rate, lrg, enhances the ability of the gate optimizer to adjust gate parameters and leads to higher sparsity. While selecting different knowledge loss penalty factors, λ, from 0.2 to 0.5, we observe the similar compact model performance (values changed within 5.62% in F1 score and 0.01 in final training mixed loss) in Fig. 7. In experiments, we choose λ = 0.35 for the final evaluation.