Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Streaming Attention Approximation via Discrepancy Theory
Authors: Ekaterina Kochetkova, Kshiteej Jitesh Sheth, Insu Han, Amir Zandieh, Michael Kapralov
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In Section 4 we empirically evaluate our algorithm in various settings. In Section 4.1 we show our approach leads to a lower relative error for single layer attention approximation for open-source LLMs including Llama-3.1-8B-Instruct [27] and Ministral-8B-Instruct- 2410 [52] as compared to uniformly sampling keys and values in the cache. Section 4.1 we also perform ablation studies to show how various parameters in our algorithm affect the relative error for single layer attention approximation. In Sections 4.2 and 4.3 we perform end to end experiments on various benchmarks such as Long Bench [5] using models of various sizes such as Llama-3.1-8B-Instruct,Qwen-2.5-14B-Instruct and Qwen-2.5-32B-Instruct [77, 71], and Needle in a Haystack [39]. We show that our provable method for attention approximation when applied to key value cache compression performs better compared to previous existing token subset selection heuristics on end to end tasks. Finally in Section 4.4 we present system efficiency metrics regarding our implementation. |
| Researcher Affiliation | Collaboration | Ekaterina Kochetkova EPFL EMAIL Kshiteej Sheth EPFL EMAIL Insu Han KAIST EMAIL Amir Zandieh Google Research EMAIL Michael Kapralov EPFL EMAIL |
| Pseudocode | Yes | Algorithm 1 BALANCEKV((qj, kj, vj)n j=1, r, t, T, ε) Algorithm 2 SOFTMAXBALANCE((kj, vj)j, rkey, rvalue, δ) Algorithm 3 Self-Balancing Walk ((uj)j, r, δ) Algorithm 4 MERGEANDREDUCE((kj, vj)j, t, T, ε) |
| Open Source Code | Yes | Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: The code for all experiments is provided in the supplementary material. |
| Open Datasets | Yes | We evaluate the effectiveness of BALANCEKV for approximating attention in individual layers of Llama-3.1-8B-Instruct [27] and Ministral-8B-Instruct-2410 [52] on the Trivia QA dataset from Long Bench [5]. Next we evaluate BALANCEKV on Long Bench dataset [5], which tests long-context understanding across tasks like QA, summarization, few-shot learning, synthetic reasoning, and code completion. 2. We perform the evaluation of BALANCEKV and the uniform sampling when applied to the Intern VL2.5-8B multimodal LLM for compression rates 1/4 and 1/16, for evaluation on the MS COCO image captioning dataset. |
| Dataset Splits | Yes | Next we evaluate BALANCEKV on Long Bench dataset [5], which tests long-context understanding across tasks like QA, summarization, few-shot learning, synthetic reasoning, and code completion. Specifically, we test a version of uniform length distribution (Long Bench-E). We follow the same evaluation metrics from [5]. We evaluate BALANCEKV on the Needle-In-A-Haystack benchmark [39], comparing it against Snap KV, Pyramid KV, Streaming LLM, and uniform sampling using Llama-3.1-8B-Instruct. The test challenges the model to retrieve a specific sentence (the needle ) embedded at an arbitrary position within a long context (the haystack ). |
| Hardware Specification | Yes | Experiments in Section 4.1 and Section 4.3 are performed on a single NVIDIA A100 GPU with 80GB VRAM, and the rest on a single NVIDIA RTX A6000 GPU with 48GB VRAM. |
| Software Dependencies | No | The paper mentions "MInference [37]" for comparing against other methods and "Pytorch: An imperative style, high-performance deep learning library" [54] in the references, but does not provide specific version numbers for these or other software components used in their own implementation. |
| Experiment Setup | Yes | We vary the compression rate 2 T {1/2, 1/4, 1/8, 1/16}. For a fixed dataset and layer, we also analyzed how the performance and runtime of BALANCEKV depend on the batch size and compression rate. More precisely, we repeat the single-layer attention approximation experiment Trivia QA and layers 1 and 15 of Llama-3.1-8B-Instruct, for batch size [64, 128, 256] and compression rate 2 T [1/2, 1/4, 1/8]. During inference, we compress the key-value cache in the prefill stage using a uniform compression rate of (approximately) 0.25 across all methods, while retaining all streamed embeddings during the decoding phase. We set b = 256 and T = 2, achieving a consistent compression rate of 0.25 across all inputs. |