Apprentice: Using Knowledge Distillation Techniques To Improve Low-Precision Network Accuracy
Authors: Asit Mishra, Debbie Marr
ICLR 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our approach, Apprentice, achieves state-of-the-art accuracies using ternary precision and 4-bit precision for variants of Res Net architecture on Image Net dataset. We present three schemes using which one can apply knowledge distillation techniques to various stages of the train-and-deploy pipeline. |
| Researcher Affiliation | Industry | Asit Mishra & Debbie Marr Accelerator Architecture Lab Intel Labs {asit.k.mishra,debbie.marr}@intel.com |
| Pseudocode | No | The paper describes its methods in prose and uses diagrams but does not include any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper references third-party tools like 'Tensor Flow' and 'Torch implementation of Res Net' with their URLs, but does not provide a link or statement about open-sourcing the code for their own 'Apprentice' methodology. |
| Open Datasets | Yes | Our approach, Apprentice, achieves state-of-the-art accuracies using ternary precision and 4-bit precision for variants of Res Net architecture on Image Net dataset. On Imagenet-1K (Russakovsky et al., 2015), TTQ achieves 33.4% Top-1 error rate with a Res Net-18 model. In addition to Image Net dataset, we also experiment with Apprentice scheme on CIFAR-10 dataset. CIFAR-10 dataset (Krizhevsky, 2009) consists of 50K training images and 10K testing images in 10 classes. |
| Dataset Splits | Yes | Table 1: Top-1 validation set error rate (%) on Image Net-1K for Res Net-18 student network as precision of activations (A) and weight (W) changes. |
| Hardware Specification | No | The paper mentions training on 'CPU and/or GPU clusters' but does not provide any specific details about the hardware used, such as GPU models, CPU types, or memory specifications. |
| Software Dependencies | No | The paper mentions 'Tensor Flow (Abadi et al., 2015)' and 'Torch implementation of Res Net' but does not provide specific version numbers for these or any other software dependencies. |
| Experiment Setup | Yes | We use a batch size of 256 and no hyper-parameters are changed from what is mentioned in the recipe. For the third term in equation 1, we experimented with a mean-squared error loss function and also a loss function with logits from both the student and the teacher network (i.e. H(z T , z A)). In general, we find training with a learning rate of 1e-3 for 10 to 15 epochs, followed by 1e-4 for another 5 to 10 epochs, followed by 1e-5 for another 5 epochs to give us the best accuracy. Some configurations run for about 40 to 50 epochs before stabilizing. We set α = 1, β = 0.5 and γ = 0.5. |