Less is More: Task-aware Layer-wise Distillation for Language Model Compression

Authors: Chen Liang, Simiao Zuo, Qingru Zhang, Pengcheng He, Weizhu Chen, Tuo Zhao

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate TED in two scenarios: continual pre-training and fine-tuning. TED demonstrates significant and consistent improvements over existing distillation methods in both scenarios.
Researcher Affiliation Collaboration 1H. Milton Stewart School of Industrial and Systems Engineering, Georgia Institute of Technology, Atlanta, U.S.A. 2Microsoft, Redmond, U.S.A.
Pseudocode No The paper describes the two-stage training procedure but does not include a formal pseudocode block or algorithm.
Open Source Code Yes Code is available at https://github.com/cliang1453/ task-aware-distillation.
Open Datasets Yes We use Open Web Text3 (Gokaslan et al., 2019)... LAMBADA (Paperno et al., 2016) and Wiki Text-103 (Merity et al., 2017). GLUE (Wang et al. 2019) benchmark... SQu AD v1.1/2.0 (Rajpurkar et al., 2016a; 2018)
Dataset Splits Yes Table 14. Summary of the GLUE benchmark. Corpus Task #Train #Dev #Test #Label Metrics (shows #Dev column). Table 5. Evaluation results on GLUE dev set.
Hardware Specification Yes We use mixed precision training and train on 8 80G Nvidia A100 GPUs. We use mixed precision training and train on 8 32G Nvidia V100 GPUs.
Software Dependencies Yes Our implementation is based on Huggingface Transformers 6.
Experiment Setup Yes Table 10. Hyper-parameters for training GPT-26 on Open Web Text. Table 12. Hyper-parameters for fine-tuning BERT-base12 on MNLI. Table 13. Hyper-parameters for fine-tuning De BERTa V3 models on the downstream tasks.