Any-Precision LLM: Low-Cost Deployment of Multiple, Different-Sized LLMs
Authors: Yeonhong Park, Jake Hyun, Sanglyul Cho, Bonggeun Sim, Jae W. Lee
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our extensive experimental studies demonstrate that our solution is a powerful approach for the deployment of multiple, differentsized LLMs, achieving the following results: ...Our solution efficiently packs LLMs quantized to varying bit-widths, such as 3, 4, ... up to n bits, into a memory footprint comparable to a single n-bit LLM. Our solution yields a set of quantized LLMs of varying bit-widths that, while offering any-precision support, match the quality of the state-of-the-art quantization techniques at each bit-width. Our solution, despite having to adopt a bit-interleaved (bitplane) memory layout for the support of any-precision, showcases high inference throughput, matching or even outperforming that of state-of-the-art quantized matrixvector multiplication engines that do not support any-precision (Kim et al., 2023b). |
| Researcher Affiliation | Academia | Yeonhong Park 1 Jake Hyun 1 Sang Lyul Cho 1 Bonggeun Sim 1 Jae W. Lee 1 1Seoul National University. Correspondence to: Jae W. Lee <jaewlee@snu.ac.kr>. |
| Pseudocode | Yes | Algorithm 1 presents a modified version of GPTQ that additionally includes a clamping operation to preserve the essential weight-inheriting characteristic of the upscaling process. ... Algorithm 1 Incremental Upscaling of GPTQ |
| Open Source Code | Yes | The code is available at https://github.com/ SNU-ARC/any-precision-llm. |
| Open Datasets | Yes | We evaluate the models with two metrics: perplexity on three datasets (Wiki Text2 (Merity et al., 2016), PTB (Marcus et al., 1994), C4 (Raffel et al., 2023)) and zero-shot accuracy on five tasks (ARCeasy/challenge (Clark et al., 2018), Hella Swag (Zellers et al., 2019), PIQA (Tata & Patel, 2003), Wino Grande (Sakaguchi et al., 2021)). |
| Dataset Splits | Yes | For C4, we concatenate samples from the validation set, as using the whole unsampled dataset is infeasible and impractical due to the large size of the dataset. |
| Hardware Specification | Yes | We conduct experiments on three GPUs of varying scales: RTX 4090 (desktop), RTX 4070 Laptop (laptop), and Jetson AGX Orin 64 GB (mobile). ... We measure the runtime of the any-precision quantization process, beginning with a 3-bit seed model and progressing up to the final 8-bit parent model, on an Intel i9-13900K CPU with 24 cores. |
| Software Dependencies | No | The paper mentions software tools like 'cu BLAS' and 'Tensor RT-LLM (NVIDIA)' but does not provide specific version numbers for these or other software dependencies required for reproduction. |
| Experiment Setup | Yes | We evaluate 4 to 8-bit models obtained through incremental upscaling, using a 3-bit Squeeze LLM model as the seed model. ... We benchmark our method on LLa MA-2-7B (Touvron et al., 2023), Mistral7B (Jiang et al., 2023), and three OPT models (6.7B, 2.7B, 1.3B) (Zhang et al., 2022). We evaluate the models with two metrics: perplexity on three datasets (Wiki Text2 (Merity et al., 2016), PTB (Marcus et al., 1994), C4 (Raffel et al., 2023)) and zero-shot accuracy on five tasks (ARCeasy/challenge (Clark et al., 2018), Hella Swag (Zellers et al., 2019), PIQA (Tata & Patel, 2003), Wino Grande (Sakaguchi et al., 2021)). |