A Good Learner can Teach Better: Teacher-Student Collaborative Knowledge Distillation

Authors: Ayan Sengupta, Shantanu Dixit, Md Shad Akhtar, Tanmoy Chakraborty

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Exhaustive experiments on Super GLUE and GLUE benchmarks demonstrate the efficacy of MPDistil compared to 20 conventional KD and advanced Meta KD baselines, showing significant performance enhancements in the student model e.g., a distilled 6-layer BERT model outperforms a 12-layer BERT model on five out of six Super GLUE tasks.
Researcher Affiliation Academia 1Indian Institute of Technology Delhi, India 2Indraprastha Institute of Information Technology Delhi, India
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks clearly labeled as such.
Open Source Code Yes Source code of MPDistil can be found at https://github.com/notmyname16/MPDistil.
Open Datasets Yes we evaluate MPDistil on 15 different natural language understanding tasks from Super GLUE (Wang et al., 2019) and GLUE (Wang et al., 2018) benchmarks.
Dataset Splits Yes Meta-teacher learning and student curriculum learning require a separate labelled dataset, for which we split the original training dataset provided in Super GLUE and GLUE tasks into an updated training and quiz sets by 9 : 1. However, we use the original datasets provided in the benchmarks for dev and test.
Hardware Specification Yes One Tesla V100 and A100-40 GPU were used for conducting the experiments.
Software Dependencies No The paper mentions 'Adam optimizer' but does not provide specific version numbers for software dependencies or libraries like Python, PyTorch, or TensorFlow.
Experiment Setup Yes For all the Super GLUE and GLUE tasks, we use a maximum sequence length of 128. All the results reported in the paper (except for the results obtained from other baselines) are obtained from running a grid search over the set of hyperparameters, teacher learning rate from {2e-5, 3e-5}, the student learning rate from {2e-5, 3e-5}, τ from {4.0, 5.0, 6.0, 7.0}, α from {0.4, 0.5, 0.6}, and β from {80, 90, 100}. The discount factor γ is set as 0.99. The meta-teacher and the curriculum models are trained with a fixed learning rate of 0.001. We train all the models with a maximum of 10 epochs, and the curriculum model is trained with 200 episodes. We set the training, quiz and validation batch size as 8.