BiE: Bi-Exponent Block Floating-Point for Large Language Models Quantization

Authors: Lancheng Zou, Wenqian Zhao, Shuo Yin, Chen Bai, Qi Sun, Bei Yu

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We propose a novel numerical representation, Bi Exponent Block Floating Point (Bi E), tackling the drawbacks of current quantization methods on LLMs from the various data characteristics. We analyze that Bi E shows advantages of data efficiency and quantization error reduction and beats SOTA baselines. We propose an offline thresholding optimization strategy to enhance the Bi E encoding flow with Bayesian Optimization. We implement the Bi E hardware design to validate the hardware efficiency of Bi E. The simulation results demonstrate that Bi E W4A4 quantization configuration can obtain 3.51 computationand 2.8 memoryefficiency improvements compared with FP16. 4. Experiments Table 2. Comparison with different methods and different quantization configurations for OPT-models on Wikitext2 (Perplexity ). We highlight our 4-bit Bi E results which are comparable with Smooth Quant W8A8.
Researcher Affiliation Academia 1The Chinese University of Hong Kong, China 2Zhejiang University, China. Correspondence to: Qi Sun <qisunchn@zju.edu.cn>, Bei Yu <byu@cse.cuhk.edu.hk>.
Pseudocode No The paper describes computational processes but does not include any figure, block, or section explicitly labeled 'Pseudocode' or 'Algorithm'.
Open Source Code No The paper mentions using 'Pytorch (Paszke et al., 2019) and huggingface (Wolf et al., 2019) libraries to quantize the models' but does not provide an explicit statement or link indicating that the source code for their proposed Bi E methodology is publicly available.
Open Datasets Yes Models and Datasets We evaluate Bi E with two representative families of LLMs: OPT (Zhang et al., 2022a) ranging from 6.7B to 66B and LLa MA-2 (Touvron et al., 2023) including 7B, 13B and 70B. For threshold searching, we use Pile datasets (Gao et al., 2020) with random 128 samples as the calibration dataset to get the statistical characteristics of activations and weights. Seven zero-shot NLP tasks including multiple choice, commonsense reasoning, language modeling, etc.: LAMBADA (Paperno et al., 2016), Arc easy (Clark et al., 2018), PIQA (Bisk et al., 2020), COPA (Roemmele et al., 2011), QNLI (Wang et al., 2018) SST2 (Socher et al., 2013), Wiki Text2 (Merity et al., 2016) are utilized for evaluated on OPT and LLa MA-2 models.
Dataset Splits Yes For threshold searching, we use Pile datasets (Gao et al., 2020) with random 128 samples as the calibration dataset to get the statistical characteristics of activations and weights.
Hardware Specification Yes we use the Pytorch (Paszke et al., 2019) and huggingface (Wolf et al., 2019) libraries to quantize the models on four A100-80G GPUs. ... The Bi E PE array is implemented by Xilinx Vitis High-Level Synthesis Tool (HLS) 2020.1 targeting Field Programmable Gate Arrays (FPGA), which is Xilinx Zynq Ultra Scale+ ZCU104 Evaluation Board.
Software Dependencies Yes we use the Pytorch (Paszke et al., 2019) and huggingface (Wolf et al., 2019) libraries to quantize the models on four A100-80G GPUs. ... The Bi E PE array is implemented by Xilinx Vitis High-Level Synthesis Tool (HLS) 2020.1 targeting Field Programmable Gate Arrays (FPGA), which is Xilinx Zynq Ultra Scale+ ZCU104 Evaluation Board.
Experiment Setup Yes The quantization configurations are W4A4 and W3A3 for Bi E and BFP, with 4-bit mantissa and 3-bit mantissa, respectively; both of them have 5-bit for one shared exponent. We use 16 elements as a block for both activations and weights, which is a slice along the matrix row. Each tensor will use a threshold to distinguish between normal and outlier parts. We quantized all the matrix multiplications in the Transformer Decoder Layer, including all the Linear Layer, bmm in OPT models and matmul in LLa MA-2 models, and left other parts in FP16, e.g., Softmax and Layer Norm. In our implementation, Plo and Phi are set as 75% and 95%.