reproducibilityindex.ai

On student-teacher deviations in distillation: does it pay to disobey?

Authors: Vaishnavh Nagarajan, Aditya K. Menon, Srinadh Bhojanapalli, Hossein Mobahi, Sanjiv Kumar

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	First, through experiments on image and language data, we identify that these probability deviations correspond to the student systematically exaggerating the conﬁdence levels of the teacher. Next, we theoretically and empirically establish another form of exaggeration in some simple settings: KD exaggerates the implicit bias of gradient descent in converging faster along the top eigendirections of the data.
Researcher Affiliation	Industry	Vaishnavh Nagarajan Google Research vaishnavh@google.com Aditya Krishna Menon Google Research adityakmenon@google.com Srinadh Bhojanapalli Google Research bsrinadh@google.com Hossein Mobahi Google Research hmobahi@google.com Sanjiv Kumar Google Research sanjivk@google.com
Pseudocode	No	The paper does not include any pseudocode or clearly labeled algorithm blocks.
Open Source Code	No	The paper does not provide an explicit statement about releasing its own source code for the described methodology or a link to a code repository.
Open Datasets	Yes	The tasks considered include image classiﬁcation benchmarks, namely CIFAR10, CIFAR-100 [26], Tiny-Image Net [27], Image Net [45] and text classiﬁcation tasks from the GLUE benchmark (e.g., MNLI [51], AGNews [55]).
Dataset Splits	No	The paper mentions the use of 'train' and 'test' sets, and tables often include 'Train accuracy' and 'Test accuracy'. However, it does not explicitly state details about a 'validation' dataset split, such as percentages or sample counts.
Hardware Specification	Yes	For all CIFAR experiments in this section we use GPUs. These experiments take a couple of hours. We run all the other experiments on TPUv3.
Software Dependencies	No	The paper mentions using 'Torch Vision implementation' for some models, but it does not specify version numbers for any software dependencies (e.g., Python, PyTorch, TensorFlow, CUDA).
Experiment Setup	Yes	For all image datasets, we follow the settings in Table 1. Table 1: Summary of training settings on image data. Hyperparameter: Weight decay 5e-4, Batch size 1024, Epochs 450, Peak learning rate 1.0, Learning rate warmup epochs 15, Learning rate decay factor 0.1, Nesterov momentum 0.9, Distillation weight 1.0, Distillation temperature 4.0, Gradual loss switch window 1k steps. For all text datasets, we use a batch size of 64, and train for 25000 steps. We use a peak learning rate of 10-5, with 1000 warmup steps, decayed linearly. For the distillation experiments on text data, we use a distillation weight of 1.0. We use temperature τ = 2.0 for MNLI, τ = 16.0 for IMDB, τ = 1.0 for QQP, and τ = 1.0 for AGNews.