GPT3.int8(): 8-bit Matrix Multiplication for Transformers at Scale

Authors: Tim Dettmers, Mike Lewis, Younes Belkada, Luke Zettlemoyer

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show empirically it is possible to perform inference in LLMs with up to 175B parameters without any performance degradation. This result makes such models much more accessible, for example making it possible to use OPT-175B/BLOOM on a single server with consumer GPUs. We measure the robustness of quantization methods as we scale the size of several publicly available pretrained language models up to 175B parameters.
Researcher Affiliation Collaboration University of Washingtonλ Facebook AI Research Hugging Face ENS Paris-Saclay
Pseudocode Yes Figure 2: Schematic of LLM.int8(). Given 16-bit floating-point inputs Xf16 and weights Wf16, the features and weights are decomposed into sub-matrices of large magnitude features and other values. The outlier feature matrices are multiplied in 16-bit. All other values are multiplied in 8-bit. We perform 8-bit vector-wise multiplication by scaling by row and column-wise absolute maximum of Cx and Cw and then quantizing the outputs to Int8.
Open Source Code Yes We open-source our software3 and release a Hugging Face Transformers (Wolf et al., 2019) integration making our method available to all hosted Hugging Face Models that have linear layers. 3https://github.com/Tim Dettmers/bitsandbytes
Open Datasets Yes For the language modeling setup, we use dense autoregressive transformers pretrained in fairseq (Ott et al., 2019) ranging between 125M and 13B parameters. ... To evaluate the language modeling degradation after Int8 quantization, we evaluate the perplexity of the 8-bit transformer on validation data of the C4 corpus (Raffel et al., 2019) which is a subset of the Common Crawl corpus. ... To measure degradation in zeroshot performance, we use OPT models (Zhang et al., 2022), and we evaluate these models on the Eleuther AI language model evaluation harness (Gao et al., 2021).
Dataset Splits Yes To evaluate the language modeling degradation after Int8 quantization, we evaluate the perplexity of the 8-bit transformer on validation data of the C4 corpus (Raffel et al., 2019) which is a subset of the Common Crawl corpus.
Hardware Specification Yes We use NVIDIA A40 GPUs for this evaluation. Enterprise 8x A100 80 GB OPT-175B / BLOOM; Academic server 8x RTX 3090 24 GB OPT-175B / BLOOM
Software Dependencies No The paper mentions software frameworks like 'Hugging Face Transformers' and 'fairseq' and 'Tensorflow-Mesh', but it does not specify exact version numbers for these or other software dependencies.
Experiment Setup Yes In our work, we find that α = 6.0 is sufficient to reduce transformer performance degradation close to zero. We use two setups for our experiments. One is based on language modeling perplexity, which we find to be a highly robust measure that is very sensitive to quantization degradation. We use this setup to compare different quantization baselines. Additionally, we evaluate zeroshot accuracy degradation on OPT models for a range of different end tasks, where we compare our methods with a 16-bit baseline.