BiLLM: Pushing the Limit of Post-Training Quantization for LLMs

Authors: Wei Huang, Yangdong Liu, Haotong Qin, Ying Li, Shiming Zhang, Xianglong Liu, Michele Magno, Xiaojuan Qi

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate that Bi LLM achieve the state-of-the-art (SOTA) performance for LLMs across multiple LLM families on various evaluation metrics, and first achieves extremely compact 1.07 1.11 bit-width in average for the PTQ binarization.
Researcher Affiliation Academia 1The University of Hong Kong 2Beihang University 3ETH Z urich.
Pseudocode Yes Algorithm 1 illustrates the complete process of Bi LLM, and detailed implementation of Bi LLM is shown in Appendix A. Algorithm 2 Bi LLM: Detailed functions process
Open Source Code Yes Our code is available at https://github.com/Aaronhuang-778/Bi LLM.
Open Datasets Yes We consider the test of Wiki Text2 (Merity et al., 2016), PTB (Marcus et al., 1994), as well as a part of the C4 (Raffel et al., 2020) data.
Dataset Splits No The paper does not specify training, validation, and test dataset splits (e.g., 80/10/10 split or specific sample counts for each split).
Hardware Specification Yes All the binarization processes and experiments are conducted on a single 80 GB NVIDIA A100.
Software Dependencies No We deploy Bi LLM within the Pytorch (Paszke et al., 2019)Huggingface libraries (Wolf et al., 2019).
Experiment Setup Yes We deploy the Bi LLM on the OPT models (Zhang et al., 2022) under the condition of a block size equal to 128. Algorithm 1 func Binary LLM(W, X, β, λ) Input: W Rn m weight matrix X Rr d calibration data β block size λ hessian regularizer Output: B binarized weights