Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
SqueezeAttention: 2D Management of KV-Cache in LLM Inference via Layer-wise Optimal Budget
Authors: Zihao Wang, Bin CUI, Shaoduo Gan
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In our experiments, we integrate SQUEEZEATTENTION into 7 popular LLM models ranging from 7B to 70B, i.e., Llama2-7B, Mistral-7B, Falcon-7B, OPT-6.7B, GPT-Neox-20B, Mixtral-8 7B, and Llama2-70B, combining with 3 representative sequence-wise KV-cache compression algorithms, i.e., Heavy-Hitter Oracle (H2O), Sliding Window Attention and Streaming LLM. The results show that SQUEEZEATTENTION can achieve better model performance with even lower cache budgets than all three algorithms under a wide range of models and tasks, which lead to approximately 30% to 70% of the memory savings and up to 2.2 of throughput improvements for inference. |
| Researcher Affiliation | Collaboration | Zihao Wang , Bin Cui , Shaoduo Gan School of CS & Key Lab of High Confidence Software Technologies (MOE), Peking University Geoming AI EMAIL, EMAIL |
| Pseudocode | Yes | Algorithm 1 SQUEEZEATTENTION |
| Open Source Code | Yes | The code is available at https://github.com/hetailang/Squeeze Attention. |
| Open Datasets | Yes | We conduct experiments on 5 datasets: CNN/Daily Mail, XSUM, Trivia QA, SAMSUM, and Narrative QA. Trivia QA, SAMSUM, and Narrative QA originate from Long Bench Bai et al. (2023), where the data length typically exceeds 8k. CNN/Daily Mail and XSUM have an average length of about 2k. Detailed information about datasets can be found in Table 1. |
| Dataset Splits | No | The paper lists datasets used (CNN/Daily Mail, XSUM, Trivia QA, SAMSUM, Narrative QA) and cites Long Bench as the origin for some, but does not explicitly provide training/test/validation dataset splits, percentages, or specific methodologies for partitioning the data within the text. |
| Hardware Specification | Yes | Hardwares. We conduct all the experiments on the AWS platform (p4d.24xlarge) with 8 Nvidia A100-40GB GPUs, interconnected by the NVLinks (600 GB/s GPU peer-to-peer bandwidth). |
| Software Dependencies | No | The paper does not explicitly list specific software dependencies (e.g., Python, PyTorch, CUDA) with version numbers. |
| Experiment Setup | Yes | SQUEEZEATTENTION involves a hyperparameter p to control the percentage of initial budgets that could be removed from the unimportant layers. The smaller the p is, the more budgets will be reassigned. In experiments, we found 0.3-0.4 is a reasonable choice range in most cases. To precisely understand the impact of p, we have conducted extra experiments to demonstrate how the model accuracy changs with the value of p, please refer to A.2 for more details. We choose the compression hyperparameters for each algorithm such that they could all achieve the best mode accuracy. The result show that our algorithm can obviously increase the throughput compared with those SOTA algorithms that only compress KV-cache from the sequence s dimension. |