CaM: Cache Merging for Memory-efficient LLMs Inference

Authors: Yuxin Zhang, Yuxuan Du, Gen Luo, Yunshan Zhong, Zhenyu Zhang, Shiwei Liu, Rongrong Ji

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments utilizing LLa MA, OPT, and GPTNeo X across various benchmarks corroborate Ca M s proficiency in bolstering the performance of memory-efficient LLMs. In this section, we quantitatively examine the effectiveness of the proposed Ca M method for enhancing the performance of memory-efficient LLMs.
Researcher Affiliation Academia 1Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University 2Peng Cheng Laboratory 3University of Texas at Austin 4University of Oxford 5Eindhoven University of Technology 6Institute of Artificial Intelligence, Xiamen University.
Pseudocode Yes Algorithm 1 Cache Merging at t-th Generation Step
Open Source Code Yes Code is released at https://github.com/zyxxmu/cam.
Open Datasets Yes Tasks. We conduct experiments on representative tasks for LLM evaluation including question-answering, text summarization, and language modeling. Respectively, for question answering, we test six tasks using lm-eval-harness (Gao et al., 2021) framework: COPA (Roemmele et al., 2011), Math QA (Amini et al., 2019), Open Book QA (Mihaylov et al., 2018), Pi QA (Bisk et al., 2020), RTE (Wang et al., 2018), Winogrande (Sakaguchi et al., 2021). For text summarization, we use HELM framework to evaluate the XSUM (Narayan et al., 2018), and CNN/Daily Mail (Nallapati et al., 2016) tasks. For language modeling, we evaluate the Perplexity performance of LLMs on Wikitext-2 (Merity et al., 2016) and PG-19 (Rae et al., 2019) datasets.
Dataset Splits No The paper describes using various datasets for evaluation (e.g., COPA, Math QA, Open Book QA, Pi QA, RTE, Winogrande, XSUM, CNN/Daily Mail, Wikitext-2, PG-19) and frameworks like lm-eval-harness and HELM, but it does not specify explicit training, validation, or test dataset splits in terms of percentages, sample counts, or detailed splitting methodology.
Hardware Specification Yes All experiments were conducted on NVIDIA Tesla A800 GPUs, utilizing five distinct random seeds, from which we report both the mean and variance of the results.
Software Dependencies No The paper mentions using frameworks such as 'lm-eval-harness' and 'HELM' for evaluation, and models like LLaMA, OPT, and GPT-Neo X, but it does not specify version numbers for any software, libraries, or programming languages used in the experiments.
Experiment Setup Yes Table 1: Performance comparison between different methods w. or w.o. Ca M. Experiments are conducted with zero-shot evaluation under 20% KV cache budget. Table 3: Accuarcy comparison of setting different clamp intervals when sampling the merging mask. Experiments are conducted with 20% KV cache budget on LLa MA-7b, utilizing Streaming LLM as the baseline of Ca M. The term A = Pt k=1 Ak denotes the cumulative attention score across previous generation steps, which supersedes the conventional individual attention at a single step and has been demonstrated to be more robust in predicting future attention as detailed in (Zhang et al., 2023b). The process of cache merging is subsequently realized through the following operation: Vk = Vk + M Vi m, for k = j, . . . , j + m. (15)