Knowledge Distillation from Internal Representations

Authors: Gustavo Aguilar, Yuan Ling, Yu Zhang, Benjamin Yao, Xing Fan, Chenlei Guo7350-7357

AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We experiment with datasets from the GLUE benchmark and consistently show that adding knowledge distillation from internal representations is a more powerful method than only using soft-label distillation. [...] Experiments and Results Datasets We conduct experiments on four datasets of the GLUE benchmark (Wang et al. 2018)
Researcher Affiliation Collaboration Gustavo Aguilar,1 Yuan Ling,2 Yu Zhang,2 Benjamin Yao,2 Xing Fan,2 Chenlei Guo2 1Department of Computer Science, University of Houston, Houston, USA 2Alexa AI, Amazon, Seattle, USA gaguilaralas@uh.edu, {yualing, yzzhan, banjamy, fanxing, guochenl}@amazon.com
Pseudocode Yes Algorithm 1 Stacked Internal Distillation (SID)
Open Source Code No The paper does not provide explicit statements or links indicating that the source code for the described methodology is publicly available.
Open Datasets Yes We conduct experiments on four datasets of the GLUE benchmark (Wang et al. 2018), which we describe briefly: 1. Co LA. The Corpus of Linguistic Acceptability (Warstadt, Singh, and Bowman 2018) [...] 2. QQP. The Quora Question Pairs4 is a semantic similarity dataset, where the task is to determine whether two questions are semantically equivalent or not. [...] 4data.quora.com/First-Quora-Dataset-Release-Question-Pairs [...] 3. MRPC. The Microsoft Research Paraphrase Corpus (Dolan and Brockett 2005) [...] 4. RTE. The Recognizing Textual Entailment (Wang et al. 2018)
Dataset Splits Yes We conduct experiments on four datasets of the GLUE benchmark (Wang et al. 2018) [...] Table 1 shows the results on the development set across four datasets. [...] The test results from the best models according to the development set.
Hardware Specification No The paper discusses computational limitations but does not provide specific details about the hardware used for experiments.
Software Dependencies No The paper mentions optimizers and models (e.g., Adam, BERT) but does not provide specific version numbers for any software dependencies.
Experiment Setup Yes We optimize our models using Adam with an initial learning rate of 2e-5 and a learning rate scheduler as described by Devlin et al. (2018). We fine-tune BERTbase for 10 epochs, and the simplified BERT models for 50 epochs both with a batch size of 32 samples and a maximum sequence length of 64 tokens.