On student-teacher deviations in distillation: does it pay to disobey?

Authors: Vaishnavh Nagarajan, Aditya K. Menon, Srinadh Bhojanapalli, Hossein Mobahi, Sanjiv Kumar

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental First, through experiments on image and language data, we identify that these probability deviations correspond to the student systematically exaggerating the confidence levels of the teacher. Next, we theoretically and empirically establish another form of exaggeration in some simple settings: KD exaggerates the implicit bias of gradient descent in converging faster along the top eigendirections of the data.
Researcher Affiliation Industry Vaishnavh Nagarajan Google Research vaishnavh@google.com Aditya Krishna Menon Google Research adityakmenon@google.com Srinadh Bhojanapalli Google Research bsrinadh@google.com Hossein Mobahi Google Research hmobahi@google.com Sanjiv Kumar Google Research sanjivk@google.com
Pseudocode No The paper does not include any pseudocode or clearly labeled algorithm blocks.
Open Source Code No The paper does not provide an explicit statement about releasing its own source code for the described methodology or a link to a code repository.
Open Datasets Yes The tasks considered include image classification benchmarks, namely CIFAR10, CIFAR-100 [26], Tiny-Image Net [27], Image Net [45] and text classification tasks from the GLUE benchmark (e.g., MNLI [51], AGNews [55]).
Dataset Splits No The paper mentions the use of 'train' and 'test' sets, and tables often include 'Train accuracy' and 'Test accuracy'. However, it does not explicitly state details about a 'validation' dataset split, such as percentages or sample counts.
Hardware Specification Yes For all CIFAR experiments in this section we use GPUs. These experiments take a couple of hours. We run all the other experiments on TPUv3.
Software Dependencies No The paper mentions using 'Torch Vision implementation' for some models, but it does not specify version numbers for any software dependencies (e.g., Python, PyTorch, TensorFlow, CUDA).
Experiment Setup Yes For all image datasets, we follow the settings in Table 1. Table 1: Summary of training settings on image data. Hyperparameter: Weight decay 5e-4, Batch size 1024, Epochs 450, Peak learning rate 1.0, Learning rate warmup epochs 15, Learning rate decay factor 0.1, Nesterov momentum 0.9, Distillation weight 1.0, Distillation temperature 4.0, Gradual loss switch window 1k steps. For all text datasets, we use a batch size of 64, and train for 25000 steps. We use a peak learning rate of 10-5, with 1000 warmup steps, decayed linearly. For the distillation experiments on text data, we use a distillation weight of 1.0. We use temperature τ = 2.0 for MNLI, τ = 16.0 for IMDB, τ = 1.0 for QQP, and τ = 1.0 for AGNews.