Knowledge Distillation from Internal Representations
Authors: Gustavo Aguilar, Yuan Ling, Yu Zhang, Benjamin Yao, Xing Fan, Chenlei Guo7350-7357
AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We experiment with datasets from the GLUE benchmark and consistently show that adding knowledge distillation from internal representations is a more powerful method than only using soft-label distillation. [...] Experiments and Results Datasets We conduct experiments on four datasets of the GLUE benchmark (Wang et al. 2018) |
| Researcher Affiliation | Collaboration | Gustavo Aguilar,1 Yuan Ling,2 Yu Zhang,2 Benjamin Yao,2 Xing Fan,2 Chenlei Guo2 1Department of Computer Science, University of Houston, Houston, USA 2Alexa AI, Amazon, Seattle, USA gaguilaralas@uh.edu, {yualing, yzzhan, banjamy, fanxing, guochenl}@amazon.com |
| Pseudocode | Yes | Algorithm 1 Stacked Internal Distillation (SID) |
| Open Source Code | No | The paper does not provide explicit statements or links indicating that the source code for the described methodology is publicly available. |
| Open Datasets | Yes | We conduct experiments on four datasets of the GLUE benchmark (Wang et al. 2018), which we describe briefly: 1. Co LA. The Corpus of Linguistic Acceptability (Warstadt, Singh, and Bowman 2018) [...] 2. QQP. The Quora Question Pairs4 is a semantic similarity dataset, where the task is to determine whether two questions are semantically equivalent or not. [...] 4data.quora.com/First-Quora-Dataset-Release-Question-Pairs [...] 3. MRPC. The Microsoft Research Paraphrase Corpus (Dolan and Brockett 2005) [...] 4. RTE. The Recognizing Textual Entailment (Wang et al. 2018) |
| Dataset Splits | Yes | We conduct experiments on four datasets of the GLUE benchmark (Wang et al. 2018) [...] Table 1 shows the results on the development set across four datasets. [...] The test results from the best models according to the development set. |
| Hardware Specification | No | The paper discusses computational limitations but does not provide specific details about the hardware used for experiments. |
| Software Dependencies | No | The paper mentions optimizers and models (e.g., Adam, BERT) but does not provide specific version numbers for any software dependencies. |
| Experiment Setup | Yes | We optimize our models using Adam with an initial learning rate of 2e-5 and a learning rate scheduler as described by Devlin et al. (2018). We fine-tune BERTbase for 10 epochs, and the simplified BERT models for 50 epochs both with a batch size of 32 samples and a maximum sequence length of 64 tokens. |