GPT3.int8(): 8-bit Matrix Multiplication for Transformers at Scale
Authors: Tim Dettmers, Mike Lewis, Younes Belkada, Luke Zettlemoyer
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We show empirically it is possible to perform inference in LLMs with up to 175B parameters without any performance degradation. This result makes such models much more accessible, for example making it possible to use OPT-175B/BLOOM on a single server with consumer GPUs. We measure the robustness of quantization methods as we scale the size of several publicly available pretrained language models up to 175B parameters. |
| Researcher Affiliation | Collaboration | University of Washingtonλ Facebook AI Research Hugging Face ENS Paris-Saclay |
| Pseudocode | Yes | Figure 2: Schematic of LLM.int8(). Given 16-bit floating-point inputs Xf16 and weights Wf16, the features and weights are decomposed into sub-matrices of large magnitude features and other values. The outlier feature matrices are multiplied in 16-bit. All other values are multiplied in 8-bit. We perform 8-bit vector-wise multiplication by scaling by row and column-wise absolute maximum of Cx and Cw and then quantizing the outputs to Int8. |
| Open Source Code | Yes | We open-source our software3 and release a Hugging Face Transformers (Wolf et al., 2019) integration making our method available to all hosted Hugging Face Models that have linear layers. 3https://github.com/Tim Dettmers/bitsandbytes |
| Open Datasets | Yes | For the language modeling setup, we use dense autoregressive transformers pretrained in fairseq (Ott et al., 2019) ranging between 125M and 13B parameters. ... To evaluate the language modeling degradation after Int8 quantization, we evaluate the perplexity of the 8-bit transformer on validation data of the C4 corpus (Raffel et al., 2019) which is a subset of the Common Crawl corpus. ... To measure degradation in zeroshot performance, we use OPT models (Zhang et al., 2022), and we evaluate these models on the Eleuther AI language model evaluation harness (Gao et al., 2021). |
| Dataset Splits | Yes | To evaluate the language modeling degradation after Int8 quantization, we evaluate the perplexity of the 8-bit transformer on validation data of the C4 corpus (Raffel et al., 2019) which is a subset of the Common Crawl corpus. |
| Hardware Specification | Yes | We use NVIDIA A40 GPUs for this evaluation. Enterprise 8x A100 80 GB OPT-175B / BLOOM; Academic server 8x RTX 3090 24 GB OPT-175B / BLOOM |
| Software Dependencies | No | The paper mentions software frameworks like 'Hugging Face Transformers' and 'fairseq' and 'Tensorflow-Mesh', but it does not specify exact version numbers for these or other software dependencies. |
| Experiment Setup | Yes | In our work, we find that α = 6.0 is sufficient to reduce transformer performance degradation close to zero. We use two setups for our experiments. One is based on language modeling perplexity, which we find to be a highly robust measure that is very sensitive to quantization degradation. We use this setup to compare different quantization baselines. Additionally, we evaluate zeroshot accuracy degradation on OPT models for a range of different end tasks, where we compare our methods with a 16-bit baseline. |