Less is More: Task-aware Layer-wise Distillation for Language Model Compression
Authors: Chen Liang, Simiao Zuo, Qingru Zhang, Pengcheng He, Weizhu Chen, Tuo Zhao
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate TED in two scenarios: continual pre-training and fine-tuning. TED demonstrates significant and consistent improvements over existing distillation methods in both scenarios. |
| Researcher Affiliation | Collaboration | 1H. Milton Stewart School of Industrial and Systems Engineering, Georgia Institute of Technology, Atlanta, U.S.A. 2Microsoft, Redmond, U.S.A. |
| Pseudocode | No | The paper describes the two-stage training procedure but does not include a formal pseudocode block or algorithm. |
| Open Source Code | Yes | Code is available at https://github.com/cliang1453/ task-aware-distillation. |
| Open Datasets | Yes | We use Open Web Text3 (Gokaslan et al., 2019)... LAMBADA (Paperno et al., 2016) and Wiki Text-103 (Merity et al., 2017). GLUE (Wang et al. 2019) benchmark... SQu AD v1.1/2.0 (Rajpurkar et al., 2016a; 2018) |
| Dataset Splits | Yes | Table 14. Summary of the GLUE benchmark. Corpus Task #Train #Dev #Test #Label Metrics (shows #Dev column). Table 5. Evaluation results on GLUE dev set. |
| Hardware Specification | Yes | We use mixed precision training and train on 8 80G Nvidia A100 GPUs. We use mixed precision training and train on 8 32G Nvidia V100 GPUs. |
| Software Dependencies | Yes | Our implementation is based on Huggingface Transformers 6. |
| Experiment Setup | Yes | Table 10. Hyper-parameters for training GPT-26 on Open Web Text. Table 12. Hyper-parameters for fine-tuning BERT-base12 on MNLI. Table 13. Hyper-parameters for fine-tuning De BERTa V3 models on the downstream tasks. |