Any-Precision LLM: Low-Cost Deployment of Multiple, Different-Sized LLMs

Authors: Yeonhong Park, Jake Hyun, Sanglyul Cho, Bonggeun Sim, Jae W. Lee

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our extensive experimental studies demonstrate that our solution is a powerful approach for the deployment of multiple, differentsized LLMs, achieving the following results: ...Our solution efficiently packs LLMs quantized to varying bit-widths, such as 3, 4, ... up to n bits, into a memory footprint comparable to a single n-bit LLM. Our solution yields a set of quantized LLMs of varying bit-widths that, while offering any-precision support, match the quality of the state-of-the-art quantization techniques at each bit-width. Our solution, despite having to adopt a bit-interleaved (bitplane) memory layout for the support of any-precision, showcases high inference throughput, matching or even outperforming that of state-of-the-art quantized matrixvector multiplication engines that do not support any-precision (Kim et al., 2023b).
Researcher Affiliation Academia Yeonhong Park 1 Jake Hyun 1 Sang Lyul Cho 1 Bonggeun Sim 1 Jae W. Lee 1 1Seoul National University. Correspondence to: Jae W. Lee <jaewlee@snu.ac.kr>.
Pseudocode Yes Algorithm 1 presents a modified version of GPTQ that additionally includes a clamping operation to preserve the essential weight-inheriting characteristic of the upscaling process. ... Algorithm 1 Incremental Upscaling of GPTQ
Open Source Code Yes The code is available at https://github.com/ SNU-ARC/any-precision-llm.
Open Datasets Yes We evaluate the models with two metrics: perplexity on three datasets (Wiki Text2 (Merity et al., 2016), PTB (Marcus et al., 1994), C4 (Raffel et al., 2023)) and zero-shot accuracy on five tasks (ARCeasy/challenge (Clark et al., 2018), Hella Swag (Zellers et al., 2019), PIQA (Tata & Patel, 2003), Wino Grande (Sakaguchi et al., 2021)).
Dataset Splits Yes For C4, we concatenate samples from the validation set, as using the whole unsampled dataset is infeasible and impractical due to the large size of the dataset.
Hardware Specification Yes We conduct experiments on three GPUs of varying scales: RTX 4090 (desktop), RTX 4070 Laptop (laptop), and Jetson AGX Orin 64 GB (mobile). ... We measure the runtime of the any-precision quantization process, beginning with a 3-bit seed model and progressing up to the final 8-bit parent model, on an Intel i9-13900K CPU with 24 cores.
Software Dependencies No The paper mentions software tools like 'cu BLAS' and 'Tensor RT-LLM (NVIDIA)' but does not provide specific version numbers for these or other software dependencies required for reproduction.
Experiment Setup Yes We evaluate 4 to 8-bit models obtained through incremental upscaling, using a 3-bit Squeeze LLM model as the seed model. ... We benchmark our method on LLa MA-2-7B (Touvron et al., 2023), Mistral7B (Jiang et al., 2023), and three OPT models (6.7B, 2.7B, 1.3B) (Zhang et al., 2022). We evaluate the models with two metrics: perplexity on three datasets (Wiki Text2 (Merity et al., 2016), PTB (Marcus et al., 1994), C4 (Raffel et al., 2023)) and zero-shot accuracy on five tasks (ARCeasy/challenge (Clark et al., 2018), Hella Swag (Zellers et al., 2019), PIQA (Tata & Patel, 2003), Wino Grande (Sakaguchi et al., 2021)).