The case for 4-bit precision: k-bit Inference Scaling Laws
Authors: Tim Dettmers, Luke Zettlemoyer
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We run more than 35,000 experiments with 16-bit inputs and k-bit parameters to examine which zero-shot quantization methods improve scaling for 3 to 8-bit precision at scales of 19M to 176B parameters across the LLM families BLOOM, OPT, Neo X/Pythia, and GPT-2. |
| Researcher Affiliation | Academia | 1University of Washington. |
| Pseudocode | No | The paper includes mathematical equations (1, 2, 3, 4, 5, 6, 7, 8) but does not present any pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | No | The paper does not contain an explicit statement or link indicating that the source code for the methodology described is publicly available. |
| Open Datasets | Yes | To measure inference performance for k-bit quantization methods, we use perplexity on the Common Crawl subset of The Pile (Gao et al., 2020) and mean zero-shot performance on the Eleuther AI LM Evaluation harness (Gao et al., 2021). |
| Dataset Splits | No | The paper does not explicitly provide training/test/validation dataset splits for reproducibility of training models, as it primarily evaluates pre-trained Large Language Models. |
| Hardware Specification | Yes | This occurs if the inference batch size is below 60 or 200 for an RTX 3090 or RTX 4090 GPU. |
| Software Dependencies | No | The paper mentions 'CUDA kernels' but does not specify version numbers for programming languages, libraries, or other software components used in their experiments. |
| Experiment Setup | Yes | In our experiments, we use 16-bit inputs and k-bit quantized parameters for 3 k 8. Attention matrices are not quantized since they do not contain parameters. We also use a 16-bit baseline that does not use any quantization (16-bit floats). |