Reinforced Multi-Teacher Selection for Knowledge Distillation
Authors: Fei Yuan, Linjun Shou, Jian Pei, Wutao Lin, Ming Gong, Yan Fu, Daxin Jiang14284-14291
AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our extensive experimental results on several NLP tasks clearly verify the feasibility and effectiveness of our approach. |
| Researcher Affiliation | Collaboration | 1 University of Electronic Science and Technology of China 2 Microsoft STCA NLP Group 3 School of Computing Science, Simon Fraser University |
| Pseudocode | Yes | Algorithm 1: Overall Training Procedure |
| Open Source Code | No | The paper does not explicitly state that the source code for their methodology is made available, nor does it provide a direct link to it. |
| Open Datasets | Yes | We evaluate our proposed approach on three different NLP tasks from the GLUE benchmark (Wang et al. 2019), namely Sentiment Classification (SC), Paraphrase Similarity Matching (PSM) and Natural Language Inference (NLI). |
| Dataset Splits | Yes | The statistics of the data sets are shown in Table 2. We use prediction accuracy as the metric in evaluation. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for running the experiments. |
| Software Dependencies | No | The paper mentions software components like 'Patient KD' and 'BERT-Base' but does not specify their version numbers or other software dependencies with version details. |
| Experiment Setup | Yes | The learning rate is set to {1e-5, 2e-5, 5e-5}. The batch size is set to 32. The maximum sequence length is set to 128. The number of epochs is set to 4. The student models... we set the batch size to 32, the number of epochs to 4, the maximum length of sequence to 128, the learning rate to {1e-5, 2e-5, 5e-5}, the distillation temperature T to {5, 10, 20}, and the loss equilibrium coefficient α to {0.2, 0.5, 0.7}. We choose the best model based on the performance on the development set. The γ in the experiments ranges from {0.3, 0.5, 0.7, 0.9}, which is selected based on development set performance. |