SqueezeLLM: Dense-and-Sparse Quantization

Authors: Sehoon Kim, Coleman Richard Charles Hooper, Amir Gholami, Zhen Dong, Xiuyu Li, Sheng Shen, Michael W. Mahoney, Kurt Keutzer

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We extensively test Squeeze LLM on various models on language modeling tasks using the C4 and Wiki Text2 datasets as well as on the MMLU (Hendrycks et al., 2021) and Vicuna benchmarks (Chiang et al., 2023) (Sec. 5.3). Furthermore, our deployed models on A6000 GPUs also exhibit significant latency gains of up to 2.4 compared to the FP16 baseline, showcasing the effectiveness of our method in terms of both quantization performance and inference efficiency (Sec. 5.4).
Researcher Affiliation Academia Sehoon Kim * 1 Coleman Hooper * 1 Amir Gholami * 1 2 Zhen Dong 1 Xiuyu Li 1 Sheng Shen 1 Michael W. Mahoney 1 2 3 Kurt Keutzer 1 *Equal contribution 1UC Berkeley 2ICSI 3LBNL. Correspondence to: Amir Gholami <amirgh@berkeley.edu>.
Pseudocode No The paper describes the methods used but does not provide structured pseudocode or algorithm blocks.
Open Source Code Yes Our code is available at https://github.com/Squeeze AILab/Squeeze LLM.
Open Datasets Yes We conduct language modeling evaluation using the C4 (Raffel et al., 2020) and Wiki Text2 (Merity et al., 2016) datasets. We further evaluate the domain-specific knowledge and problem-solving ability using MMLU (Hendrycks et al., 2021) and the instruction-following ability using the methodology in (Chiang et al., 2023).
Dataset Splits Yes For measuring sensitivity, we use 100 random samples from the Vicuna training set for Vicuna models and C4 training set for the others.
Hardware Specification Yes Furthermore, when deployed on an A6000 GPU, our quantized models achieve up to 2.3 speedup compared to the baseline. (...) We measure the latency and peak memory usage for generating 128 and 1024 tokens on an A6000 machine using the Torch CUDA profiler. (...) we used a simple roofline-based performance modeling approach (Kim et al., 2023) to study LLa MA-7B s runtime on an A5000 GPU with different bit precisions (Fig. 2). (...) Table G.12. Matrix-vector kernel runtime (in seconds) for generating 128 tokens, benchmarked on an A100 GPU.
Software Dependencies No The paper mentions 'Torch CUDA profiler' and 'CUDA LUT-based kernels' but does not specify their version numbers or any other software dependencies with explicit versions.
Experiment Setup Yes For Squeeze LLM, we adopt channelwise quantization where each output channel is assigned a separate lookup table. We use 2 different sparsity levels: 0% (dense-only) and 0.45% (0.05% sensitive values and 0.4% outlier values, as discussed in Sec. 4.2). For measuring sensitivity, we use 100 random samples from the Vicuna training set for Vicuna models and C4 training set for the others.