Agile-Quant: Activation-Guided Quantization for Faster Inference of LLMs on the Edge
Authors: Xuan Shen, Peiyan Dong, Lei Lu, Zhenglun Kong, Zhengang Li, Ming Lin, Chao Wu, Yanzhi Wang
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments show that Agile Quant achieves simultaneous quantization of model weights and activations while maintaining task performance comparable to existing weight-only quantization methods. Moreover, in the 8and 4-bit scenario, Agile-Quant achieves an on-device speedup of up to 2.55x compared to its FP16 counterparts across multiple edge devices, marking a pioneering advancement in this domain. |
| Researcher Affiliation | Collaboration | 1Northeastern University 2Oracle |
| Pseudocode | No | The paper includes figures illustrating methods but no formal pseudocode or algorithm blocks with structured steps. |
| Open Source Code | No | The paper does not provide an explicit statement about open-sourcing the code for the described methodology or a link to a code repository. |
| Open Datasets | Yes | We implement the different scales of LLa MA, OPT, and BLOOM models in our experiments on the Wikitext-2 dataset (Merity et al. 2016) and C4 (Raffel et al. 2020) dataset. |
| Dataset Splits | No | The paper mentions using standard datasets (Wikitext-2, C4) but does not provide explicit details on train/validation/test splits (percentages, sample counts, or explicit references to standard splits for their specific experimental setup). |
| Hardware Specification | Yes | We test the actual inference implementation on various edge devices, including the Realme GT Android Phone with Snapdragon 870 So C and Raspberry4 B with Quad-core CPU and 8GB RAM. |
| Software Dependencies | Yes | Our inference engine for Arm processors is modified based on Arm Compute Library v22.05. |
| Experiment Setup | Yes | We implement the activation quantization based on the weight-only quantization work GPTQ (Frantar et al. 2022) which achieves state-of-the-art performance with 4-bit weight-only quantization for LLMs. We mainly use 4-bit and 8-bit integers in our activation quantization. We use the Log2 quantization for softmax activation quantization and use our TRIP quantization for other activations. We regulate the token pruning ratio to optimize the task performance of LLMs. We apply the prune ratio progressively starting from the shallow layers so that the token pruning can optimize the activation quantization for more deep layers. Assume the model has n layers with L = (l1, l2, ..., ln) and the pruning operation is added to the layers Lp = (lp1, lp2, ..., lpm). We set prune ratio β at the last layer ln and compute the progressive ratio γ for the layers li Lp as: γ = 1 (1 β) 1 m (9) We accumulate the token sparsity s with the number of pruned tokens during inference as: i=1 ri (10) |