reproducibilityindex.ai

DDK: Distilling Domain Knowledge for Efficient Large Language Models

Authors: Jiaheng Liu, Chenchen Zhang, Jinyang Guo, Yuanxing Zhang, Haoran Que, Ken Deng, ZhiqiBai zhiqi, Jie Liu, Ge Zhang, JiakaiWang , Yanan Wu, Congnan Liu, Jiamang Wang, Lin Qu, Wenbo Su, Bo Zheng

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive evaluations show that DDK significantly improves the performance of student models, outperforming both continuously pretrained baselines and existing knowledge distillation methods by a large margin.
Researcher Affiliation	Collaboration	1Taobao & Tmall Group of Alibaba, 2Alibaba Group, 3The University of Sydney, 4The Chinese University of Hong Kong, 5University of Waterloo
Pseudocode	Yes	Algorithm 1 Distillation procedure of the DDK framework.
Open Source Code	No	All data used in this paper is open-sourced. The codes for the baseline methods are also collected from Github.
Open Datasets	Yes	Due to the unavailability of training data for LLa MA2 and Qwen-1.5 models, we mainly utilize Red Pajama [16] for distillation... Moreover, to enhance the model s proficiency in Chinese and Mathematics, we also incorporate three cleaned open-source datasets (i.e., Chinese Books [19], Chinese Common Crawl [19], and Open Web Math [47]).
Dataset Splits	Yes	To assess the disparity in performance between teacher and student models across the ten domains, we have constructed a domain-specific validation set for each domain, where each domain includes 500 samples.
Hardware Specification	Yes	For the training framework, we employ the Deep Speed-Chat code1 as our codebase, and conduct all experiments using 16 NVIDIA A100 GPUs (80G), where Flash Attention V2 [17] is used to accelerate training.
Software Dependencies	No	For the training framework, we employ the Deep Speed-Chat code1 as our codebase, and conduct all experiments using 16 NVIDIA A100 GPUs (80G), where Flash Attention V2 [17] is used to accelerate training.
Experiment Setup	Yes	For the training schedule, we first apply the warm-up strategy to increase the learning rate from 0 to 3e 5 in 1,000 steps. Then, we use the cosine learning rate schedule, where the final learning rate is 3e 6 and the whole training step is about 30,000 steps. Empirically, we set the distillation interval K as 1,000 and the temperature T as 1.0.