MixKD: Towards Efficient Distillation of Large-scale Language Models

Authors: Kevin J Liang, Weituo Hao, Dinghan Shen, Yufan Zhou, Weizhu Chen, Changyou Chen, Lawrence Carin

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To verify its effectiveness, we conduct experiments on the GLUE benchmark, where Mix KD consistently leads to significant gains over the standard KD training, and outperforms several competitive baselines. Experiments under a limited-data setting and ablation studies further demonstrate the advantages of the proposed approach.
Researcher Affiliation Collaboration Kevin J Liang1,2 , Weituo Hao1 , Dinghan Shen3, Yufan Zhou4, Weizhu Chen3, Changyou Chen4, Lawrence Carin1 1Duke University 2Facebook AI 3Microsoft Dynamics 365 AI 4State University of New York at Buffalo
Pseudocode No No pseudocode or algorithm blocks were found in the paper.
Open Source Code No The paper does not provide concrete access to source code for the methodology described. It refers to third-party codebases like Hugging Face Transformers and fairseq, but not their own implementation code for Mix KD.
Open Datasets Yes We conduct experiments on a number of GLUE (Wang et al., 2019) dataset tasks: Stanford Sentiment Treebank (SST-2) (Socher et al., 2013), Microsoft Research Paraphrase Corpus (MRPC) (Dolan & Brockett, 2005), Quora Question Pairs (QQP)3, Multi-Genre Natural Language Inference (MNLI) (Williams et al., 2018), Question Natural Language Inference (QNLI) (Rajpurkar et al., 2016), and Recognizing Textual Entailment (RTE) (Dagan et al., 2005; Haim et al., 2006; Giampiccolo et al., 2007; Bentivogli et al., 2009).
Dataset Splits Yes We randomly select 10% and 1% of the data from these datasets to train both the teacher and student models, using the same subset for all experiments for fair comparison. In this data limited setting, we observe substantial gains from Mix KD over the fine-tuned model for QQP (+2.0%, +3.0%), MNLI-m (+3.9%, +3.4%), MNLI-mm (+4.4%, +3.3%), and QNLI (+2.4%, +4.1%) for 10% and 1% of the training data. We first analyze the contributions of each component of our method, evaluating on the dev set of the GLUE datasets.
Hardware Specification Yes Computation cost comparison of teacher and student models on SST-2 with batch size of 16 on a Nvidia TITAN X GPU.
Software Dependencies No The paper mentions software like "Hugging Face Transformers" and "fairseq" but does not provide specific version numbers for these dependencies, which is required for reproducibility.
Experiment Setup Yes We use MSE as the knowledge distillation distance metric d( , ). We generate one mixup sample for each original sample in each minibatch (mixup ratio of 1), with λ Beta(0.4, 0.4). We set hyperparameters weighting the components in the loss term in equation 8 as αSM = αTMKD = 1.