Training with Quantization Noise for Extreme Model Compression

Authors: Pierre Stock, Angela Fan, Benjamin Graham, Edouard Grave, Rémi Gribonval, Herve Jegou, Armand Joulin

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We tackle the problem of producing compact models, maximizing their accuracy for a given model size. A standard solution is to train networks with Quantization Aware Training (Jacob et al., 2018), where the weights are quantized during training and the gradients approximated with the Straight-Through Estimator (Bengio et al., 2013). In this paper, we extend this approach to work beyond int8 fixedpoint quantization with extreme compression methods where the approximations introduced by STE are severe, such as Product Quantization. Our proposal is to only quantize a different random subset of weights during each forward, allowing for unbiased gradients to flow through the other weights. Controlling the amount of noise and its form allows for extreme compression rates while maintaining the performance of the original model. As a result we establish new state-of-the-art compromises between accuracy and model size both in natural language processing and image classification. For example, applying our method to state-of-the-art Transformer and Conv Net architectures, we can achieve 82.5% accuracy on MNLI by compressing Ro BERTa to 14 MB and 80.0% top-1 accuracy on Image Net by compressing an Efficient Net-B3 to 3.3 MB. 1
Researcher Affiliation Collaboration Pierre Stock Facebook AI Research, Inria Angela Fan Facebook AI Research, LORIA Benjamin Graham Facebook AI Research Edouard Grave Facebook AI Research Rémi Gribonval Inria Hervé Jégou Facebook AI Research Armand Joulin Facebook AI Research
Pseudocode No The paper describes the proposed method in detail in Section 4 'QUANT-NOISE FOR TRAINING COMPACT MODELS' but does not include formal pseudocode or an algorithm block.
Open Source Code Yes Code available at https://github.com/pytorch/fairseq/tree/master/examples/ quant_noise
Open Datasets Yes We experiment on the Wikitext-103 benchmark (Merity et al., 2016) that contains 100M tokens and a vocabulary of 260k words. We pre-train the base BERT model (Devlin et al., 2018) on the Books Corpus + Wiki dataset with a Layer Drop rate of 0.2. We finetune the pre-trained models on the MNLI task (Williams et al., 2018) from the GLUE Benchmark (Wang et al., 2019) and report accuracy. We train an Efficient Net-B3 model (Tan & Le, 2019) on the Image Net object classification benchmark (Deng et al., 2009).
Dataset Splits Yes For image classification, we train a Efficient Net-B3 on the Image Net-1k benchmark and report top-1 accuracy on validation and use our re-implementation of Efficient Net-B3.
Hardware Specification No The paper mentions 'accelerating inference on supporting hardware' and 'on dedicated hardware' but does not specify any particular CPU, GPU, or other hardware used for running the experiments.
Software Dependencies No Our models are implemented in Py Torch (Paszke et al., 2017). We use fairseq (Ott et al., 2019) for language modeling and pre-training for sentence representation tasks and Classy Vision (Adcock et al., 2019) for Efficient Net. While software packages are named, specific version numbers are not provided for Py Torch, fairseq, or Classy Vision.
Experiment Setup Yes For both settings, we report model size in megabyte (MB) and the compression ratio compared to the original model. Our quantization noise framework is general and flexible Quant-Noise improves the performance of quantized models for every quantization scheme in both experimental settings. Importantly, Quant-Noise only changes model training by adding a regularization noise similar to dropout, with no impact on convergence and very limited impact on training speed (< 5% slower). We use a cosine learning rate schedule (Baevski & Auli, 2018; Loshchilov & Hutter, 2016) and train with Nesterov s accelerated gradient (Sutskever et al., 2013). We set the momentum to 0.99 and renormalize gradients if the norm exceeds 0.1 (Pascanu et al., 2014). We set Layer Drop to 0.2. We set Quant-Noise value to 0.05. During training time, we searched over the parameters (0.05, 0.1, 0.2) to determine the optimal value of Quant-Noise. During training time, the block size of Quant-Noise is 8.