Model compression via distillation and quantization

Authors: Antonio Polino, Razvan Pascanu, Dan Alistarh

ICLR 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We validate both methods through experiments on convolutional and recurrent architectures. We show that quantized shallow students can reach similar accuracy levels to state-of-the-art full-precision teacher models, while providing up to order of magnitude compression, and inference speedup that is almost linear in the depth reduction. In sum, our results enable DNNs for resource-constrained environments to leverage architecture and accuracy advances developed on more powerful devices.
Researcher Affiliation Collaboration Antonio Polino ETH Z urich antonio.polino1@gmail.com Razvan Pascanu Google Deep Mind razp@google.com Dan Alistarh IST Austria dan.alistarh@ist.ac.at
Pseudocode Yes Algorithm 1 Quantized Distillation, Algorithm 2 Differentiable Quantization
Open Source Code Yes Source code available at https://github.com/antspy/quantized_distillation
Open Datasets Yes The Open NMT integration test dataset (Ope) consists of 200K train sentences and 10K test sentences for a German-English translation task. To train and test models we use the Open NMT Py Torch codebase (Klein et al., 2017). Our target models consist of an embedding layer, an encoder consisting of n layers of LSTM, a decoder consisting of n layers of LSTM, and a linear layer.
Dataset Splits Yes The Open NMT integration test dataset (Ope) consists of 200K train sentences and 10K test sentences for a German-English translation task.
Hardware Specification No On the Image Net test set using 4 GPUs (data-parallel), a forward pass takes 263 seconds for Res Net34, 169 seconds for Res Net18, and 169 seconds for our 2x Res Net18. (So, while having more parameters than Res Net18, it has the same speed because it has the same number of layers, and is not wide enough to saturate the GPU. We note that we did not exploit 4bit weights, due to the lack of hardware support.)
Software Dependencies No To train and test models we use the Open NMT Py Torch codebase (Klein et al., 2017).
Experiment Setup Yes Distillation loss is computed with a temperature of T = 5. (for CIFAR-10 experiments) We train for 200 epochs with an initial learning rate of 0.1. (for CIFAR100 experiments) For Open NMT, the learning rate starts at 1 and is halved every epoch starting from the first epoch where perplexity doesn t drop on the test set. We train every model for 15 epochs. Distillation loss is computed with a temperature of T = 1.