DDK: Distilling Domain Knowledge for Efficient Large Language Models

Authors: Jiaheng Liu, Chenchen Zhang, Jinyang Guo, Yuanxing Zhang, Haoran Que, Ken Deng, ZhiqiBai zhiqi, Jie Liu, Ge Zhang, JiakaiWang , Yanan Wu, Congnan Liu, Jiamang Wang, Lin Qu, Wenbo Su, Bo Zheng

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive evaluations show that DDK significantly improves the performance of student models, outperforming both continuously pretrained baselines and existing knowledge distillation methods by a large margin.
Researcher Affiliation Collaboration 1Taobao & Tmall Group of Alibaba, 2Alibaba Group, 3The University of Sydney, 4The Chinese University of Hong Kong, 5University of Waterloo
Pseudocode Yes Algorithm 1 Distillation procedure of the DDK framework.
Open Source Code No All data used in this paper is open-sourced. The codes for the baseline methods are also collected from Github.
Open Datasets Yes Due to the unavailability of training data for LLa MA2 and Qwen-1.5 models, we mainly utilize Red Pajama [16] for distillation... Moreover, to enhance the model s proficiency in Chinese and Mathematics, we also incorporate three cleaned open-source datasets (i.e., Chinese Books [19], Chinese Common Crawl [19], and Open Web Math [47]).
Dataset Splits Yes To assess the disparity in performance between teacher and student models across the ten domains, we have constructed a domain-specific validation set for each domain, where each domain includes 500 samples.
Hardware Specification Yes For the training framework, we employ the Deep Speed-Chat code1 as our codebase, and conduct all experiments using 16 NVIDIA A100 GPUs (80G), where Flash Attention V2 [17] is used to accelerate training.
Software Dependencies No For the training framework, we employ the Deep Speed-Chat code1 as our codebase, and conduct all experiments using 16 NVIDIA A100 GPUs (80G), where Flash Attention V2 [17] is used to accelerate training.
Experiment Setup Yes For the training schedule, we first apply the warm-up strategy to increase the learning rate from 0 to 3e 5 in 1,000 steps. Then, we use the cosine learning rate schedule, where the final learning rate is 3e 6 and the whole training step is about 30,000 steps. Empirically, we set the distillation interval K as 1,000 and the temperature T as 1.0.