DDK: Distilling Domain Knowledge for Efficient Large Language Models
Authors: Jiaheng Liu, Chenchen Zhang, Jinyang Guo, Yuanxing Zhang, Haoran Que, Ken Deng, ZhiqiBai zhiqi, Jie Liu, Ge Zhang, JiakaiWang , Yanan Wu, Congnan Liu, Jiamang Wang, Lin Qu, Wenbo Su, Bo Zheng
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive evaluations show that DDK significantly improves the performance of student models, outperforming both continuously pretrained baselines and existing knowledge distillation methods by a large margin. |
| Researcher Affiliation | Collaboration | 1Taobao & Tmall Group of Alibaba, 2Alibaba Group, 3The University of Sydney, 4The Chinese University of Hong Kong, 5University of Waterloo |
| Pseudocode | Yes | Algorithm 1 Distillation procedure of the DDK framework. |
| Open Source Code | No | All data used in this paper is open-sourced. The codes for the baseline methods are also collected from Github. |
| Open Datasets | Yes | Due to the unavailability of training data for LLa MA2 and Qwen-1.5 models, we mainly utilize Red Pajama [16] for distillation... Moreover, to enhance the model s proficiency in Chinese and Mathematics, we also incorporate three cleaned open-source datasets (i.e., Chinese Books [19], Chinese Common Crawl [19], and Open Web Math [47]). |
| Dataset Splits | Yes | To assess the disparity in performance between teacher and student models across the ten domains, we have constructed a domain-specific validation set for each domain, where each domain includes 500 samples. |
| Hardware Specification | Yes | For the training framework, we employ the Deep Speed-Chat code1 as our codebase, and conduct all experiments using 16 NVIDIA A100 GPUs (80G), where Flash Attention V2 [17] is used to accelerate training. |
| Software Dependencies | No | For the training framework, we employ the Deep Speed-Chat code1 as our codebase, and conduct all experiments using 16 NVIDIA A100 GPUs (80G), where Flash Attention V2 [17] is used to accelerate training. |
| Experiment Setup | Yes | For the training schedule, we first apply the warm-up strategy to increase the learning rate from 0 to 3e 5 in 1,000 steps. Then, we use the cosine learning rate schedule, where the final learning rate is 3e 6 and the whole training step is about 30,000 steps. Empirically, we set the distillation interval K as 1,000 and the temperature T as 1.0. |