Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Any-Precision LLM: Low-Cost Deployment of Multiple, Different-Sized LLMs
Authors: Yeonhong Park, Jake Hyun, Sanglyul Cho, Bonggeun Sim, Jae W. Lee
ICML 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our extensive experimental studies demonstrate that our solution is a powerful approach for the deployment of multiple, differentsized LLMs, achieving the following results: ...Our solution efficiently packs LLMs quantized to varying bit-widths, such as 3, 4, ... up to n bits, into a memory footprint comparable to a single n-bit LLM. Our solution yields a set of quantized LLMs of varying bit-widths that, while offering any-precision support, match the quality of the state-of-the-art quantization techniques at each bit-width. Our solution, despite having to adopt a bit-interleaved (bitplane) memory layout for the support of any-precision, showcases high inference throughput, matching or even outperforming that of state-of-the-art quantized matrixvector multiplication engines that do not support any-precision (Kim et al., 2023b). |
| Researcher Affiliation | Academia | Yeonhong Park 1 Jake Hyun 1 Sang Lyul Cho 1 Bonggeun Sim 1 Jae W. Lee 1 1Seoul National University. Correspondence to: Jae W. Lee <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 presents a modified version of GPTQ that additionally includes a clamping operation to preserve the essential weight-inheriting characteristic of the upscaling process. ... Algorithm 1 Incremental Upscaling of GPTQ |
| Open Source Code | Yes | The code is available at https://github.com/ SNU-ARC/any-precision-llm. |
| Open Datasets | Yes | We evaluate the models with two metrics: perplexity on three datasets (Wiki Text2 (Merity et al., 2016), PTB (Marcus et al., 1994), C4 (Raffel et al., 2023)) and zero-shot accuracy on five tasks (ARCeasy/challenge (Clark et al., 2018), Hella Swag (Zellers et al., 2019), PIQA (Tata & Patel, 2003), Wino Grande (Sakaguchi et al., 2021)). |
| Dataset Splits | Yes | For C4, we concatenate samples from the validation set, as using the whole unsampled dataset is infeasible and impractical due to the large size of the dataset. |
| Hardware Specification | Yes | We conduct experiments on three GPUs of varying scales: RTX 4090 (desktop), RTX 4070 Laptop (laptop), and Jetson AGX Orin 64 GB (mobile). ... We measure the runtime of the any-precision quantization process, beginning with a 3-bit seed model and progressing up to the final 8-bit parent model, on an Intel i9-13900K CPU with 24 cores. |
| Software Dependencies | No | The paper mentions software tools like 'cu BLAS' and 'Tensor RT-LLM (NVIDIA)' but does not provide specific version numbers for these or other software dependencies required for reproduction. |
| Experiment Setup | Yes | We evaluate 4 to 8-bit models obtained through incremental upscaling, using a 3-bit Squeeze LLM model as the seed model. ... We benchmark our method on LLa MA-2-7B (Touvron et al., 2023), Mistral7B (Jiang et al., 2023), and three OPT models (6.7B, 2.7B, 1.3B) (Zhang et al., 2022). We evaluate the models with two metrics: perplexity on three datasets (Wiki Text2 (Merity et al., 2016), PTB (Marcus et al., 1994), C4 (Raffel et al., 2023)) and zero-shot accuracy on five tasks (ARCeasy/challenge (Clark et al., 2018), Hella Swag (Zellers et al., 2019), PIQA (Tata & Patel, 2003), Wino Grande (Sakaguchi et al., 2021)). |