reproducibilityindex.ai

Supervision Complexity and its Role in Knowledge Distillation

Authors: Hrayr Harutyunyan, Ankit Singh Rawat, Aditya Krishna Menon, Seungyeon Kim, Sanjiv Kumar

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate efficacy of online distillation and validate the theoretical findings on a range of image classification benchmarks and model architectures. ... We provide empirical results on a range of image classification benchmarks confirming the value of online distillation, particularly for students with weak inductive biases.
Researcher Affiliation	Collaboration	Hrayr Harutyunyan 1 Ankit Singh Rawat2 Aditya Krishna Menon2 Seungyeon Kim2 Sanjiv Kumar2 1 USC Information Sciences Institute 2 Google Research NYC hrayrhar@usc.edu, {ankitsrawat,adityakmenon,seungyeonk,sanjivk}@google.com
Pseudocode	Yes	Algorithm 1 Online knowledge distillation.
Open Source Code	No	The paper does not provide an explicit statement about releasing source code for the described methodology or a link to a code repository.
Open Datasets	Yes	We consider standard image classification benchmarks: CIFAR-10, CIFAR-100, and Tiny Image Net.
Dataset Splits	No	The paper refers to standard benchmark datasets (CIFAR-10, CIFAR-100, Tiny Image Net) which typically have predefined splits. However, it does not explicitly state the training, validation, and test splits (e.g., percentages or sample counts) used for the experiments. It mentions '212 test examples' for a specific analysis but not for general model validation splits.
Hardware Specification	No	The paper does not specify any particular hardware used for running the experiments (e.g., specific GPU or CPU models, memory, or cloud instances).
Software Dependencies	No	The paper mentions using a 'stochastic gradient descent optimizer' but does not provide specific software names with version numbers for libraries, frameworks (e.g., PyTorch, TensorFlow), or programming languages.
Experiment Setup	Yes	In all experiments we use stochastic gradient descent optimizer with 128 batch size and 0.9 Nesterov momentum. The starting learning rates are presented in Table 4. All models for CIFAR datasets are trained for 256 epochs, with a learning schedule that divides the learning rate by 10 at epochs 96, 192, and 224. All models for Tiny Image Net are trained for 200 epochs, with a learning rate schedule that divides the learning rate by 10 at epochs 75 and 135. The learning rate is warmed-up linearly to its initial value in the first 10 and 5 epochs for CIFAR and Tiny Image Net models respectively. All VGG and Res Net models use 2e-4 weight decay, while Mobile Net models use 1e-5 weight decay.