On student-teacher deviations in distillation: does it pay to disobey?
Authors: Vaishnavh Nagarajan, Aditya K. Menon, Srinadh Bhojanapalli, Hossein Mobahi, Sanjiv Kumar
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | First, through experiments on image and language data, we identify that these probability deviations correspond to the student systematically exaggerating the confidence levels of the teacher. Next, we theoretically and empirically establish another form of exaggeration in some simple settings: KD exaggerates the implicit bias of gradient descent in converging faster along the top eigendirections of the data. |
| Researcher Affiliation | Industry | Vaishnavh Nagarajan Google Research vaishnavh@google.com Aditya Krishna Menon Google Research adityakmenon@google.com Srinadh Bhojanapalli Google Research bsrinadh@google.com Hossein Mobahi Google Research hmobahi@google.com Sanjiv Kumar Google Research sanjivk@google.com |
| Pseudocode | No | The paper does not include any pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | No | The paper does not provide an explicit statement about releasing its own source code for the described methodology or a link to a code repository. |
| Open Datasets | Yes | The tasks considered include image classification benchmarks, namely CIFAR10, CIFAR-100 [26], Tiny-Image Net [27], Image Net [45] and text classification tasks from the GLUE benchmark (e.g., MNLI [51], AGNews [55]). |
| Dataset Splits | No | The paper mentions the use of 'train' and 'test' sets, and tables often include 'Train accuracy' and 'Test accuracy'. However, it does not explicitly state details about a 'validation' dataset split, such as percentages or sample counts. |
| Hardware Specification | Yes | For all CIFAR experiments in this section we use GPUs. These experiments take a couple of hours. We run all the other experiments on TPUv3. |
| Software Dependencies | No | The paper mentions using 'Torch Vision implementation' for some models, but it does not specify version numbers for any software dependencies (e.g., Python, PyTorch, TensorFlow, CUDA). |
| Experiment Setup | Yes | For all image datasets, we follow the settings in Table 1. Table 1: Summary of training settings on image data. Hyperparameter: Weight decay 5e-4, Batch size 1024, Epochs 450, Peak learning rate 1.0, Learning rate warmup epochs 15, Learning rate decay factor 0.1, Nesterov momentum 0.9, Distillation weight 1.0, Distillation temperature 4.0, Gradual loss switch window 1k steps. For all text datasets, we use a batch size of 64, and train for 25000 steps. We use a peak learning rate of 10-5, with 1000 warmup steps, decayed linearly. For the distillation experiments on text data, we use a distillation weight of 1.0. We use temperature τ = 2.0 for MNLI, τ = 16.0 for IMDB, τ = 1.0 for QQP, and τ = 1.0 for AGNews. |