Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

SqueezeAttention: 2D Management of KV-Cache in LLM Inference via Layer-wise Optimal Budget

Authors: Zihao Wang, Bin CUI, Shaoduo Gan

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In our experiments, we integrate SQUEEZEATTENTION into 7 popular LLM models ranging from 7B to 70B, i.e., Llama2-7B, Mistral-7B, Falcon-7B, OPT-6.7B, GPT-Neox-20B, Mixtral-8 7B, and Llama2-70B, combining with 3 representative sequence-wise KV-cache compression algorithms, i.e., Heavy-Hitter Oracle (H2O), Sliding Window Attention and Streaming LLM. The results show that SQUEEZEATTENTION can achieve better model performance with even lower cache budgets than all three algorithms under a wide range of models and tasks, which lead to approximately 30% to 70% of the memory savings and up to 2.2 of throughput improvements for inference.
Researcher Affiliation Collaboration Zihao Wang , Bin Cui , Shaoduo Gan School of CS & Key Lab of High Confidence Software Technologies (MOE), Peking University Geoming AI EMAIL, EMAIL
Pseudocode Yes Algorithm 1 SQUEEZEATTENTION
Open Source Code Yes The code is available at https://github.com/hetailang/Squeeze Attention.
Open Datasets Yes We conduct experiments on 5 datasets: CNN/Daily Mail, XSUM, Trivia QA, SAMSUM, and Narrative QA. Trivia QA, SAMSUM, and Narrative QA originate from Long Bench Bai et al. (2023), where the data length typically exceeds 8k. CNN/Daily Mail and XSUM have an average length of about 2k. Detailed information about datasets can be found in Table 1.
Dataset Splits No The paper lists datasets used (CNN/Daily Mail, XSUM, Trivia QA, SAMSUM, Narrative QA) and cites Long Bench as the origin for some, but does not explicitly provide training/test/validation dataset splits, percentages, or specific methodologies for partitioning the data within the text.
Hardware Specification Yes Hardwares. We conduct all the experiments on the AWS platform (p4d.24xlarge) with 8 Nvidia A100-40GB GPUs, interconnected by the NVLinks (600 GB/s GPU peer-to-peer bandwidth).
Software Dependencies No The paper does not explicitly list specific software dependencies (e.g., Python, PyTorch, CUDA) with version numbers.
Experiment Setup Yes SQUEEZEATTENTION involves a hyperparameter p to control the percentage of initial budgets that could be removed from the unimportant layers. The smaller the p is, the more budgets will be reassigned. In experiments, we found 0.3-0.4 is a reasonable choice range in most cases. To precisely understand the impact of p, we have conducted extra experiments to demonstrate how the model accuracy changs with the value of p, please refer to A.2 for more details. We choose the compression hyperparameters for each algorithm such that they could all achieve the best mode accuracy. The result show that our algorithm can obviously increase the throughput compared with those SOTA algorithms that only compress KV-cache from the sequence s dimension.