Sentence-Level or Token-Level? A Comprehensive Study on Knowledge Distillation

Authors: Jingxuan Wei, Linzhuang Sun, Yichong Leng, Xu Tan, Bihui Yu, Ruifeng Guo

IJCAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental While our experimental results validate our hypothesis, defining the complexity level of a given scenario remains a challenging task. So we further introduce a novel hybrid method that combines token-level and sentence-level distillation through a gating mechanism, aiming to leverage the advantages of both individual methods. Experiments demonstrate that the hybrid method surpasses the performance of token-level or sentence-level distillation methods and the previous works by a margin, demonstrating the effectiveness of the proposed hybrid method.
Researcher Affiliation Academia Jingxuan Wei1,2 , Linzhuang Sun1,2 , Yichong Leng3 , Xu Tan4 , Bihui Yu1,2 , Ruifeng Guo1,2 1Shenyang Institute of Computing Technology, Chinese Academy of Sciences 2University of Chinese Academy of Sciences 3University of Science and Technology of China 4Independent Researcher weijingxuan20@mails.ucas.edu.cn, tanxu2012@gmail.com
Pseudocode No Our hybrid method features a gate-controlled mechanism, dynamically balancing the contributions of token-level and sentence-level distillation. This mechanism, denoted as G and illustrated in Figure 1, is represented by the function g(x) for each input sequence x, modulating the influence of each distillation strategy during training to suit different translation scenarios.
Open Source Code No The experiments are conducted using the Fairseq2 framework. 1https://github.com/rsennrich/subword-nmt 2https://github.com/facebookresearch/fairseq
Open Datasets Yes For the experiments, we select four datasets to cover a range of complexities and linguistic characteristics: IWSLT13 English French (en fr), IWSLT14 German English (de en), WMT14 en de (en de), and IWSLT17 Arabic English (ar en). Each dataset offers a unique combination of bilingual sentence pairs and complexity levels: 200k for IWSLT13 en fr, 153k for IWSLT14 de en, 4.5M for WMT14 en de, and 231k for IWSLT17 ar en.
Dataset Splits No Our experiments are conducted on four NVIDIA 3090 GPUs, each with a batch size of 3000. Gradients accumulate over four iterations per update. The learning rate is set at 5 10 4, using the Adam optimizer with an inverse-sqrt learning rate scheduler. For inference, we employ a beam search with a width of 4 and a length penalty of 0.6.
Hardware Specification Yes Our experiments are conducted on four NVIDIA 3090 GPUs, each with a batch size of 3000.
Software Dependencies No We apply byte-pair encoding (BPE) with subword-nmt toolkit1 to all sentences in these datasets for tokenization. The experiments are conducted using the Fairseq2 framework.
Experiment Setup Yes Our experiments are conducted on four NVIDIA 3090 GPUs, each with a batch size of 3000. Gradients accumulate over four iterations per update. The learning rate is set at 5 10 4, using the Adam optimizer with an inverse-sqrt learning rate scheduler. For inference, we employ a beam search with a width of 4 and a length penalty of 0.6.