QUEST: Query-Aware Sparsity for Efficient Long-Context LLM Inference

Authors: Jiaming Tang, Yilong Zhao, Kan Zhu, Guangxuan Xiao, Baris Kasikci, Song Han

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate both the accuracy and efficiency of Quest. Since Quest dynamically decides the criticality of the tokens, Quest achieves better accuracy for a given degree of KV cache sparsity than baselines on PG19 dataset (Rae et al., 2019), passkey retrieval task (Peng et al., 2023), and Long Bench (Bai et al., 2023) with 256 to 4K token budgets. For 32K context, Quest achieves 7.03 self-attention latency reduction compared to Flash Infer (Ye et al., 2024). Our end-to-end framework demonstrates that Quest can have 2.23 inference speedup compared to Flash Infer (Ye et al., 2024) with 4-bit weight quantization.
Researcher Affiliation Collaboration 1Shanghai Jiao Tong University 2MIT 3University of Washington 4NVIDIA. Correspondence to: Song Han <songhan@mit.edu>, Baris Kasikci <baris@cs.washington.edu>.
Pseudocode Yes Algorithm 1 Token Criticality Estimation
Open Source Code Yes Code is available at https: //github.com/mit-han-lab/quest.
Open Datasets Yes We evaluate Quest on the language modeling dataset PG19 (Rae et al., 2019), passkey retrieval task (Peng et al., 2023), and six datasets in Long Bench (Bai et al., 2023): Narrative QA (Koˇcisk y et al., 2018), Hotpot QA (Yang et al., 2018), Qasper (Dasigi et al., 2021), Trivial QA (Joshi et al., 2017), Gov Report (Huang et al., 2021), Multifield QA (Bai et al., 2023).
Dataset Splits No The paper mentions evaluating on different datasets and tasks (PG19, passkey retrieval, Long Bench) and how inputs are processed (prefill, token by token decoding simulation), but it does not specify explicit numerical train/validation/test dataset splits (e.g., percentages, sample counts, or explicit references to predefined splits that define these partitions) for any of the datasets used.
Hardware Specification Yes Tested with FP16 Flash Infer implementation on an RTX4090
Software Dependencies Yes We evaluate Quest s kernel-level efficiency under the configuration of Llama2-7B on an RTX4090 with CUDA 12.2 in Sec 4.3.1.
Experiment Setup Yes We choose two widely used long-context models for our evaluation: Long Chat-v1.5-7b-32k (Li et al., 2023) and Yarn-Llama-2-7b-128k (Peng et al., 2023). We compare our method against the KV cache eviction algorithm H2O (Zhang et al., 2023b), TOVA (Oren et al., 2024), and Streaming LLM (Xiao et al., 2023). Note that we do not apply any Quest and other baseline algorithms to the first two layers of the model, as our analysis in Sec 3.4 indicates a low sparsity ratio for these layers.