Apprentice: Using Knowledge Distillation Techniques To Improve Low-Precision Network Accuracy

Authors: Asit Mishra, Debbie Marr

ICLR 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our approach, Apprentice, achieves state-of-the-art accuracies using ternary precision and 4-bit precision for variants of Res Net architecture on Image Net dataset. We present three schemes using which one can apply knowledge distillation techniques to various stages of the train-and-deploy pipeline.
Researcher Affiliation Industry Asit Mishra & Debbie Marr Accelerator Architecture Lab Intel Labs {asit.k.mishra,debbie.marr}@intel.com
Pseudocode No The paper describes its methods in prose and uses diagrams but does not include any structured pseudocode or algorithm blocks.
Open Source Code No The paper references third-party tools like 'Tensor Flow' and 'Torch implementation of Res Net' with their URLs, but does not provide a link or statement about open-sourcing the code for their own 'Apprentice' methodology.
Open Datasets Yes Our approach, Apprentice, achieves state-of-the-art accuracies using ternary precision and 4-bit precision for variants of Res Net architecture on Image Net dataset. On Imagenet-1K (Russakovsky et al., 2015), TTQ achieves 33.4% Top-1 error rate with a Res Net-18 model. In addition to Image Net dataset, we also experiment with Apprentice scheme on CIFAR-10 dataset. CIFAR-10 dataset (Krizhevsky, 2009) consists of 50K training images and 10K testing images in 10 classes.
Dataset Splits Yes Table 1: Top-1 validation set error rate (%) on Image Net-1K for Res Net-18 student network as precision of activations (A) and weight (W) changes.
Hardware Specification No The paper mentions training on 'CPU and/or GPU clusters' but does not provide any specific details about the hardware used, such as GPU models, CPU types, or memory specifications.
Software Dependencies No The paper mentions 'Tensor Flow (Abadi et al., 2015)' and 'Torch implementation of Res Net' but does not provide specific version numbers for these or any other software dependencies.
Experiment Setup Yes We use a batch size of 256 and no hyper-parameters are changed from what is mentioned in the recipe. For the third term in equation 1, we experimented with a mean-squared error loss function and also a loss function with logits from both the student and the teacher network (i.e. H(z T , z A)). In general, we find training with a learning rate of 1e-3 for 10 to 15 epochs, followed by 1e-4 for another 5 to 10 epochs, followed by 1e-5 for another 5 epochs to give us the best accuracy. Some configurations run for about 40 to 50 epochs before stabilizing. We set α = 1, β = 0.5 and γ = 0.5.