Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval
Authors: Di Liu, Meng Chen, Baotong Lu, Huiqiang Jiang, Zhenhua Han, Qianxi Zhang, Qi Chen, Chengruidong Zhang, Bailu Ding, Kai Zhang, Chen Chen, Fan Yang, Yuqing Yang, Lili Qiu
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our evaluation shows Retrieval Attention achieves near full attention accuracy while accessing only 1-3% of the data, significantly reducing inference costs. Remarkably, Retrieval Attention enables LLMs with 8B parameters to handle 128K tokens on a single NVIDIA RTX4090 (24GB), achieving a decoding speed of 0.107 seconds per token. We thoroughly evaluate the accuracy and efficiency of Retrieval Attention across three long-context LLMs, using well-known long-context benchmarks like -Bench [19] and RULER [20]. |
| Researcher Affiliation | Collaboration | 1Microsoft Research 2Shanghai Jiao Tong University 3Fudan University |
| Pseudocode | Yes | Algorithm 1 summarizes the design of Retrieval Attention and elaborates the procedure in an algorithm. |
| Open Source Code | Yes | Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: We provide detailed descriptions of our experiments and implementation in 4 and A.1. Additionally, we submit our code in supplementary materials. |
| Open Datasets | Yes | We adopt three representative long-context benchmarks for evaluation. (1) RULER [20]: a comprehensive and widely used long-context benchmark consisting of retrieval, aggregation, QA tasks, and so on. (2) -Bench [19]: this benchmark consists of retrieval tasks and realistic tasks. The average context length of -Bench is over 100K tokens. (3) Needle-in-a-haystack [35]: it challenges the models to accurately retrieve information (the needle") hidden within a lengthy document (the haystack"). |
| Dataset Splits | No | The paper mentions several benchmarks and evaluates performance on them, but it does not explicitly describe the training/test/validation splits used for its experiments within the text. |
| Hardware Specification | Yes | We conduct experiments on a server equipped with one NVIDIA RTX4090 GPU (24GB memory) and an Intel i9-10900X CPU with 10 physical cores and 128GB DRAM. The experiment results using NVIDIA A100 GPU can be found in Appendix A.5. We test the generality of Retrieval Attention by measuring its performance on a server with one A100 GPU (40GB) and one AMD EPYC 7V13 CPU with 12 physical cores and 220GB DRAM. To make sure there is enough CPU memory to hold the KV cache and indexes, especially in the 1M context scenario, we use a powerful machine equipped with an AMD EPYC 7V12 CPU with 48 cores and 1.72 TB of memory. The machine is also equipped with the same 40GB A100 GPU. |
| Software Dependencies | No | The paper mentions using specific LLMs (e.g., Llama-3-8B-Instruct-262k) and a library (Faiss), but it does not provide specific version numbers for any software libraries, programming languages, or tools used in the experiments. |
| Experiment Setup | Yes | All indexing-based methods, including Retrieval Attention, retrieve the top-100 nearest key vectors and use 640 tokens (128 initial tokens + 512 local window tokens) on GPU as the static pattern. All evaluated methods are applied across all model layers for sparse attention during the decoding phase. Detailed configurations of baselines can be found in Appendix A.1. For example, Streaming LLM: it uses 512 initial tokens and 1536 recent tokens for attention computation. Snap KV: it employs an observation window of 32 tokens during the prefill phase and retains the top-2048 tokens. Table 6: Performance (%) of Retrieval Attention with different static pattern sizes on RULER. Sizes Accuracy (%) 16+64 73.93 32+128 74.15 64+256 74.52 128+512 74.70 |