ArkVale: Efficient Generative LLM Inference with Recallable Key-Value Eviction

Authors: Renze Chen, Zhuofeng Wang, Beiquan Cao, Tong Wu, Size Zheng, Xiuhong Li, Xuechao Wei, Shengen Yan, Meng Li, Yun Liang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiment results show that ARKVALE performs well on various long context tasks with negligible accuracy loss under 2k 4k cache budget and can improve decoding latency up to 2.2 (1.7 in average) and batching throughput up to 4.6 (3.5 in average).
Researcher Affiliation Collaboration Renze Chen Peking University crz@pku.edu.cn Zhuofeng Wang Peking University 2200012827@stu.pku.edu.cn Beiquan Cao Peking University 2200012988@stu.pku.edu.cn Tong Wu Peking University 2200013212@stu.pku.edu.cn Size Zheng Peking University zhengsz@pku.edu.cn Xiuhong Li Peking University lixiuhong@pku.edu.cn Xuechao Wei Peking University xuechao.wei@pku.edu.cn Shengen Yan Infinigence-AI yanshengen@gmail.com Meng Li Peking University meng.li@pku.edu.cn Yun Liang Peking University ericlyun@pku.edu.cn
Pseudocode No The paper does not contain any pseudocode or algorithm blocks.
Open Source Code Yes Our code is now available at https://github.com/pku-liang/Ark Vale.
Open Datasets Yes We apply our method to Long Chat-7b-v1.5-32k [1] and use 6 datasets from Long Bench [9] for benchmarking: Hotpot QA [59], Narrative QA [35], Qasper [20], Gov Report [28], Trivia QA [30], and Passage Retrieval [9], along with the passkey-retrieval tasks.
Dataset Splits No The paper mentions using datasets for benchmarking and simulation, but it does not explicitly provide details about specific training/validation/test splits, their percentages, or how they were derived for the experiments.
Hardware Specification Yes Our experiment platform comprises an Intel(R) Xeon(R) Gold 6348 CPU @ 2.60GHz and an NVIDIA A100 80GB PCIe GPU.
Software Dependencies Yes The software stack includes CUDA version 12.3, Py Torch [41, 8] version 2.3.0, and Hugging Face Transformers [57] version 4.40.0. We implement ARKVALE on top of Huggingface Transformers, with CUTLASS [54], Flash Infer [60], and RAFT [44] for certain kernels.
Experiment Setup Yes We configure four cache budget settings: 4096, 2048, 1024, and 512. ...with settings of a batch-size=4, page-size=32, and KV cache budgets of 512, 1024, 2048, and 4096. for page-size p and cache-capacity c (tokens), we set k = min(C, c/2)/p, where C is a hyper-parameters (default C = 40 * 32 = 1280).