reproducibilityindex.ai

Training with Quantization Noise for Extreme Model Compression

Authors: Pierre Stock, Angela Fan, Benjamin Graham, Edouard Grave, Rémi Gribonval, Herve Jegou, Armand Joulin

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We tackle the problem of producing compact models, maximizing their accuracy for a given model size. A standard solution is to train networks with Quantization Aware Training (Jacob et al., 2018), where the weights are quantized during training and the gradients approximated with the Straight-Through Estimator (Bengio et al., 2013). In this paper, we extend this approach to work beyond int8 ﬁxedpoint quantization with extreme compression methods where the approximations introduced by STE are severe, such as Product Quantization. Our proposal is to only quantize a different random subset of weights during each forward, allowing for unbiased gradients to ﬂow through the other weights. Controlling the amount of noise and its form allows for extreme compression rates while maintaining the performance of the original model. As a result we establish new state-of-the-art compromises between accuracy and model size both in natural language processing and image classiﬁcation. For example, applying our method to state-of-the-art Transformer and Conv Net architectures, we can achieve 82.5% accuracy on MNLI by compressing Ro BERTa to 14 MB and 80.0% top-1 accuracy on Image Net by compressing an Efﬁcient Net-B3 to 3.3 MB. 1
Researcher Affiliation	Collaboration	Pierre Stock Facebook AI Research, Inria Angela Fan Facebook AI Research, LORIA Benjamin Graham Facebook AI Research Edouard Grave Facebook AI Research Rémi Gribonval Inria Hervé Jégou Facebook AI Research Armand Joulin Facebook AI Research
Pseudocode	No	The paper describes the proposed method in detail in Section 4 'QUANT-NOISE FOR TRAINING COMPACT MODELS' but does not include formal pseudocode or an algorithm block.
Open Source Code	Yes	Code available at https://github.com/pytorch/fairseq/tree/master/examples/ quant_noise
Open Datasets	Yes	We experiment on the Wikitext-103 benchmark (Merity et al., 2016) that contains 100M tokens and a vocabulary of 260k words. We pre-train the base BERT model (Devlin et al., 2018) on the Books Corpus + Wiki dataset with a Layer Drop rate of 0.2. We ﬁnetune the pre-trained models on the MNLI task (Williams et al., 2018) from the GLUE Benchmark (Wang et al., 2019) and report accuracy. We train an Efﬁcient Net-B3 model (Tan & Le, 2019) on the Image Net object classiﬁcation benchmark (Deng et al., 2009).
Dataset Splits	Yes	For image classiﬁcation, we train a Efﬁcient Net-B3 on the Image Net-1k benchmark and report top-1 accuracy on validation and use our re-implementation of Efﬁcient Net-B3.
Hardware Specification	No	The paper mentions 'accelerating inference on supporting hardware' and 'on dedicated hardware' but does not specify any particular CPU, GPU, or other hardware used for running the experiments.
Software Dependencies	No	Our models are implemented in Py Torch (Paszke et al., 2017). We use fairseq (Ott et al., 2019) for language modeling and pre-training for sentence representation tasks and Classy Vision (Adcock et al., 2019) for Efﬁcient Net. While software packages are named, specific version numbers are not provided for Py Torch, fairseq, or Classy Vision.
Experiment Setup	Yes	For both settings, we report model size in megabyte (MB) and the compression ratio compared to the original model. Our quantization noise framework is general and ﬂexible Quant-Noise improves the performance of quantized models for every quantization scheme in both experimental settings. Importantly, Quant-Noise only changes model training by adding a regularization noise similar to dropout, with no impact on convergence and very limited impact on training speed (< 5% slower). We use a cosine learning rate schedule (Baevski & Auli, 2018; Loshchilov & Hutter, 2016) and train with Nesterov s accelerated gradient (Sutskever et al., 2013). We set the momentum to 0.99 and renormalize gradients if the norm exceeds 0.1 (Pascanu et al., 2014). We set Layer Drop to 0.2. We set Quant-Noise value to 0.05. During training time, we searched over the parameters (0.05, 0.1, 0.2) to determine the optimal value of Quant-Noise. During training time, the block size of Quant-Noise is 8.