Supervision Complexity and its Role in Knowledge Distillation
Authors: Hrayr Harutyunyan, Ankit Singh Rawat, Aditya Krishna Menon, Seungyeon Kim, Sanjiv Kumar
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate efficacy of online distillation and validate the theoretical findings on a range of image classification benchmarks and model architectures. ... We provide empirical results on a range of image classification benchmarks confirming the value of online distillation, particularly for students with weak inductive biases. |
| Researcher Affiliation | Collaboration | Hrayr Harutyunyan 1 Ankit Singh Rawat2 Aditya Krishna Menon2 Seungyeon Kim2 Sanjiv Kumar2 1 USC Information Sciences Institute 2 Google Research NYC hrayrhar@usc.edu, {ankitsrawat,adityakmenon,seungyeonk,sanjivk}@google.com |
| Pseudocode | Yes | Algorithm 1 Online knowledge distillation. |
| Open Source Code | No | The paper does not provide an explicit statement about releasing source code for the described methodology or a link to a code repository. |
| Open Datasets | Yes | We consider standard image classification benchmarks: CIFAR-10, CIFAR-100, and Tiny Image Net. |
| Dataset Splits | No | The paper refers to standard benchmark datasets (CIFAR-10, CIFAR-100, Tiny Image Net) which typically have predefined splits. However, it does not explicitly state the training, validation, and test splits (e.g., percentages or sample counts) used for the experiments. It mentions '212 test examples' for a specific analysis but not for general model validation splits. |
| Hardware Specification | No | The paper does not specify any particular hardware used for running the experiments (e.g., specific GPU or CPU models, memory, or cloud instances). |
| Software Dependencies | No | The paper mentions using a 'stochastic gradient descent optimizer' but does not provide specific software names with version numbers for libraries, frameworks (e.g., PyTorch, TensorFlow), or programming languages. |
| Experiment Setup | Yes | In all experiments we use stochastic gradient descent optimizer with 128 batch size and 0.9 Nesterov momentum. The starting learning rates are presented in Table 4. All models for CIFAR datasets are trained for 256 epochs, with a learning schedule that divides the learning rate by 10 at epochs 96, 192, and 224. All models for Tiny Image Net are trained for 200 epochs, with a learning rate schedule that divides the learning rate by 10 at epochs 75 and 135. The learning rate is warmed-up linearly to its initial value in the first 10 and 5 epochs for CIFAR and Tiny Image Net models respectively. All VGG and Res Net models use 2e-4 weight decay, while Mobile Net models use 1e-5 weight decay. |