reproducibilityindex.ai

Knowledge Distillation from Internal Representations

Authors: Gustavo Aguilar, Yuan Ling, Yu Zhang, Benjamin Yao, Xing Fan, Chenlei Guo7350-7357

AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We experiment with datasets from the GLUE benchmark and consistently show that adding knowledge distillation from internal representations is a more powerful method than only using soft-label distillation. [...] Experiments and Results Datasets We conduct experiments on four datasets of the GLUE benchmark (Wang et al. 2018)
Researcher Affiliation	Collaboration	Gustavo Aguilar,1 Yuan Ling,2 Yu Zhang,2 Benjamin Yao,2 Xing Fan,2 Chenlei Guo2 1Department of Computer Science, University of Houston, Houston, USA 2Alexa AI, Amazon, Seattle, USA gaguilaralas@uh.edu, {yualing, yzzhan, banjamy, fanxing, guochenl}@amazon.com
Pseudocode	Yes	Algorithm 1 Stacked Internal Distillation (SID)
Open Source Code	No	The paper does not provide explicit statements or links indicating that the source code for the described methodology is publicly available.
Open Datasets	Yes	We conduct experiments on four datasets of the GLUE benchmark (Wang et al. 2018), which we describe brieﬂy: 1. Co LA. The Corpus of Linguistic Acceptability (Warstadt, Singh, and Bowman 2018) [...] 2. QQP. The Quora Question Pairs4 is a semantic similarity dataset, where the task is to determine whether two questions are semantically equivalent or not. [...] 4data.quora.com/First-Quora-Dataset-Release-Question-Pairs [...] 3. MRPC. The Microsoft Research Paraphrase Corpus (Dolan and Brockett 2005) [...] 4. RTE. The Recognizing Textual Entailment (Wang et al. 2018)
Dataset Splits	Yes	We conduct experiments on four datasets of the GLUE benchmark (Wang et al. 2018) [...] Table 1 shows the results on the development set across four datasets. [...] The test results from the best models according to the development set.
Hardware Specification	No	The paper discusses computational limitations but does not provide specific details about the hardware used for experiments.
Software Dependencies	No	The paper mentions optimizers and models (e.g., Adam, BERT) but does not provide specific version numbers for any software dependencies.
Experiment Setup	Yes	We optimize our models using Adam with an initial learning rate of 2e-5 and a learning rate scheduler as described by Devlin et al. (2018). We ﬁne-tune BERTbase for 10 epochs, and the simpliﬁed BERT models for 50 epochs both with a batch size of 32 samples and a maximum sequence length of 64 tokens.