reproducibilityindex.ai

Apprentice: Using Knowledge Distillation Techniques To Improve Low-Precision Network Accuracy

Authors: Asit Mishra, Debbie Marr

ICLR 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our approach, Apprentice, achieves state-of-the-art accuracies using ternary precision and 4-bit precision for variants of Res Net architecture on Image Net dataset. We present three schemes using which one can apply knowledge distillation techniques to various stages of the train-and-deploy pipeline.
Researcher Affiliation	Industry	Asit Mishra & Debbie Marr Accelerator Architecture Lab Intel Labs {asit.k.mishra,debbie.marr}@intel.com
Pseudocode	No	The paper describes its methods in prose and uses diagrams but does not include any structured pseudocode or algorithm blocks.
Open Source Code	No	The paper references third-party tools like 'Tensor Flow' and 'Torch implementation of Res Net' with their URLs, but does not provide a link or statement about open-sourcing the code for their own 'Apprentice' methodology.
Open Datasets	Yes	Our approach, Apprentice, achieves state-of-the-art accuracies using ternary precision and 4-bit precision for variants of Res Net architecture on Image Net dataset. On Imagenet-1K (Russakovsky et al., 2015), TTQ achieves 33.4% Top-1 error rate with a Res Net-18 model. In addition to Image Net dataset, we also experiment with Apprentice scheme on CIFAR-10 dataset. CIFAR-10 dataset (Krizhevsky, 2009) consists of 50K training images and 10K testing images in 10 classes.
Dataset Splits	Yes	Table 1: Top-1 validation set error rate (%) on Image Net-1K for Res Net-18 student network as precision of activations (A) and weight (W) changes.
Hardware Specification	No	The paper mentions training on 'CPU and/or GPU clusters' but does not provide any specific details about the hardware used, such as GPU models, CPU types, or memory specifications.
Software Dependencies	No	The paper mentions 'Tensor Flow (Abadi et al., 2015)' and 'Torch implementation of Res Net' but does not provide specific version numbers for these or any other software dependencies.
Experiment Setup	Yes	We use a batch size of 256 and no hyper-parameters are changed from what is mentioned in the recipe. For the third term in equation 1, we experimented with a mean-squared error loss function and also a loss function with logits from both the student and the teacher network (i.e. H(z T , z A)). In general, we ﬁnd training with a learning rate of 1e-3 for 10 to 15 epochs, followed by 1e-4 for another 5 to 10 epochs, followed by 1e-5 for another 5 epochs to give us the best accuracy. Some conﬁgurations run for about 40 to 50 epochs before stabilizing. We set α = 1, β = 0.5 and γ = 0.5.