Sentence-Level or Token-Level? A Comprehensive Study on Knowledge Distillation
Authors: Jingxuan Wei, Linzhuang Sun, Yichong Leng, Xu Tan, Bihui Yu, Ruifeng Guo
IJCAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | While our experimental results validate our hypothesis, defining the complexity level of a given scenario remains a challenging task. So we further introduce a novel hybrid method that combines token-level and sentence-level distillation through a gating mechanism, aiming to leverage the advantages of both individual methods. Experiments demonstrate that the hybrid method surpasses the performance of token-level or sentence-level distillation methods and the previous works by a margin, demonstrating the effectiveness of the proposed hybrid method. |
| Researcher Affiliation | Academia | Jingxuan Wei1,2 , Linzhuang Sun1,2 , Yichong Leng3 , Xu Tan4 , Bihui Yu1,2 , Ruifeng Guo1,2 1Shenyang Institute of Computing Technology, Chinese Academy of Sciences 2University of Chinese Academy of Sciences 3University of Science and Technology of China 4Independent Researcher weijingxuan20@mails.ucas.edu.cn, tanxu2012@gmail.com |
| Pseudocode | No | Our hybrid method features a gate-controlled mechanism, dynamically balancing the contributions of token-level and sentence-level distillation. This mechanism, denoted as G and illustrated in Figure 1, is represented by the function g(x) for each input sequence x, modulating the influence of each distillation strategy during training to suit different translation scenarios. |
| Open Source Code | No | The experiments are conducted using the Fairseq2 framework. 1https://github.com/rsennrich/subword-nmt 2https://github.com/facebookresearch/fairseq |
| Open Datasets | Yes | For the experiments, we select four datasets to cover a range of complexities and linguistic characteristics: IWSLT13 English French (en fr), IWSLT14 German English (de en), WMT14 en de (en de), and IWSLT17 Arabic English (ar en). Each dataset offers a unique combination of bilingual sentence pairs and complexity levels: 200k for IWSLT13 en fr, 153k for IWSLT14 de en, 4.5M for WMT14 en de, and 231k for IWSLT17 ar en. |
| Dataset Splits | No | Our experiments are conducted on four NVIDIA 3090 GPUs, each with a batch size of 3000. Gradients accumulate over four iterations per update. The learning rate is set at 5 10 4, using the Adam optimizer with an inverse-sqrt learning rate scheduler. For inference, we employ a beam search with a width of 4 and a length penalty of 0.6. |
| Hardware Specification | Yes | Our experiments are conducted on four NVIDIA 3090 GPUs, each with a batch size of 3000. |
| Software Dependencies | No | We apply byte-pair encoding (BPE) with subword-nmt toolkit1 to all sentences in these datasets for tokenization. The experiments are conducted using the Fairseq2 framework. |
| Experiment Setup | Yes | Our experiments are conducted on four NVIDIA 3090 GPUs, each with a batch size of 3000. Gradients accumulate over four iterations per update. The learning rate is set at 5 10 4, using the Adam optimizer with an inverse-sqrt learning rate scheduler. For inference, we employ a beam search with a width of 4 and a length penalty of 0.6. |