Agile-Quant: Activation-Guided Quantization for Faster Inference of LLMs on the Edge

Authors: Xuan Shen, Peiyan Dong, Lei Lu, Zhenglun Kong, Zhengang Li, Ming Lin, Chao Wu, Yanzhi Wang

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments show that Agile Quant achieves simultaneous quantization of model weights and activations while maintaining task performance comparable to existing weight-only quantization methods. Moreover, in the 8and 4-bit scenario, Agile-Quant achieves an on-device speedup of up to 2.55x compared to its FP16 counterparts across multiple edge devices, marking a pioneering advancement in this domain.
Researcher Affiliation Collaboration 1Northeastern University 2Oracle
Pseudocode No The paper includes figures illustrating methods but no formal pseudocode or algorithm blocks with structured steps.
Open Source Code No The paper does not provide an explicit statement about open-sourcing the code for the described methodology or a link to a code repository.
Open Datasets Yes We implement the different scales of LLa MA, OPT, and BLOOM models in our experiments on the Wikitext-2 dataset (Merity et al. 2016) and C4 (Raffel et al. 2020) dataset.
Dataset Splits No The paper mentions using standard datasets (Wikitext-2, C4) but does not provide explicit details on train/validation/test splits (percentages, sample counts, or explicit references to standard splits for their specific experimental setup).
Hardware Specification Yes We test the actual inference implementation on various edge devices, including the Realme GT Android Phone with Snapdragon 870 So C and Raspberry4 B with Quad-core CPU and 8GB RAM.
Software Dependencies Yes Our inference engine for Arm processors is modified based on Arm Compute Library v22.05.
Experiment Setup Yes We implement the activation quantization based on the weight-only quantization work GPTQ (Frantar et al. 2022) which achieves state-of-the-art performance with 4-bit weight-only quantization for LLMs. We mainly use 4-bit and 8-bit integers in our activation quantization. We use the Log2 quantization for softmax activation quantization and use our TRIP quantization for other activations. We regulate the token pruning ratio to optimize the task performance of LLMs. We apply the prune ratio progressively starting from the shallow layers so that the token pruning can optimize the activation quantization for more deep layers. Assume the model has n layers with L = (l1, l2, ..., ln) and the pruning operation is added to the layers Lp = (lp1, lp2, ..., lpm). We set prune ratio β at the last layer ln and compute the progressive ratio γ for the layers li Lp as: γ = 1 (1 β) 1 m (9) We accumulate the token sparsity s with the number of pruned tokens during inference as: i=1 ri (10)