Model compression via distillation and quantization
Authors: Antonio Polino, Razvan Pascanu, Dan Alistarh
ICLR 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We validate both methods through experiments on convolutional and recurrent architectures. We show that quantized shallow students can reach similar accuracy levels to state-of-the-art full-precision teacher models, while providing up to order of magnitude compression, and inference speedup that is almost linear in the depth reduction. In sum, our results enable DNNs for resource-constrained environments to leverage architecture and accuracy advances developed on more powerful devices. |
| Researcher Affiliation | Collaboration | Antonio Polino ETH Z urich antonio.polino1@gmail.com Razvan Pascanu Google Deep Mind razp@google.com Dan Alistarh IST Austria dan.alistarh@ist.ac.at |
| Pseudocode | Yes | Algorithm 1 Quantized Distillation, Algorithm 2 Differentiable Quantization |
| Open Source Code | Yes | Source code available at https://github.com/antspy/quantized_distillation |
| Open Datasets | Yes | The Open NMT integration test dataset (Ope) consists of 200K train sentences and 10K test sentences for a German-English translation task. To train and test models we use the Open NMT Py Torch codebase (Klein et al., 2017). Our target models consist of an embedding layer, an encoder consisting of n layers of LSTM, a decoder consisting of n layers of LSTM, and a linear layer. |
| Dataset Splits | Yes | The Open NMT integration test dataset (Ope) consists of 200K train sentences and 10K test sentences for a German-English translation task. |
| Hardware Specification | No | On the Image Net test set using 4 GPUs (data-parallel), a forward pass takes 263 seconds for Res Net34, 169 seconds for Res Net18, and 169 seconds for our 2x Res Net18. (So, while having more parameters than Res Net18, it has the same speed because it has the same number of layers, and is not wide enough to saturate the GPU. We note that we did not exploit 4bit weights, due to the lack of hardware support.) |
| Software Dependencies | No | To train and test models we use the Open NMT Py Torch codebase (Klein et al., 2017). |
| Experiment Setup | Yes | Distillation loss is computed with a temperature of T = 5. (for CIFAR-10 experiments) We train for 200 epochs with an initial learning rate of 0.1. (for CIFAR100 experiments) For Open NMT, the learning rate starts at 1 and is halved every epoch starting from the first epoch where perplexity doesn t drop on the test set. We train every model for 15 epochs. Distillation loss is computed with a temperature of T = 1. |