reproducibilityindex.ai

Less is More: Task-aware Layer-wise Distillation for Language Model Compression

Authors: Chen Liang, Simiao Zuo, Qingru Zhang, Pengcheng He, Weizhu Chen, Tuo Zhao

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate TED in two scenarios: continual pre-training and ﬁne-tuning. TED demonstrates signiﬁcant and consistent improvements over existing distillation methods in both scenarios.
Researcher Affiliation	Collaboration	1H. Milton Stewart School of Industrial and Systems Engineering, Georgia Institute of Technology, Atlanta, U.S.A. 2Microsoft, Redmond, U.S.A.
Pseudocode	No	The paper describes the two-stage training procedure but does not include a formal pseudocode block or algorithm.
Open Source Code	Yes	Code is available at https://github.com/cliang1453/ task-aware-distillation.
Open Datasets	Yes	We use Open Web Text3 (Gokaslan et al., 2019)... LAMBADA (Paperno et al., 2016) and Wiki Text-103 (Merity et al., 2017). GLUE (Wang et al. 2019) benchmark... SQu AD v1.1/2.0 (Rajpurkar et al., 2016a; 2018)
Dataset Splits	Yes	Table 14. Summary of the GLUE benchmark. Corpus Task #Train #Dev #Test #Label Metrics (shows #Dev column). Table 5. Evaluation results on GLUE dev set.
Hardware Specification	Yes	We use mixed precision training and train on 8 80G Nvidia A100 GPUs. We use mixed precision training and train on 8 32G Nvidia V100 GPUs.
Software Dependencies	Yes	Our implementation is based on Huggingface Transformers 6.
Experiment Setup	Yes	Table 10. Hyper-parameters for training GPT-26 on Open Web Text. Table 12. Hyper-parameters for ﬁne-tuning BERT-base12 on MNLI. Table 13. Hyper-parameters for ﬁne-tuning De BERTa V3 models on the downstream tasks.