Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Efficient Low Rank Attention for Long-Context Inference in Large Language Models
Authors: Li Tenghui, Guoxu Zhou, Xuyang Zhao, Yuning Qiu, Qibin Zhao
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on the RULER and Long Bench benchmarks with LLa MA-3-8B and Qwen2.5-7B demonstrate that LRQK matches or surpasses leading sparse-attention methods in long context settings, while delivering significant memory savings with minimal accuracy loss. Our code is available at https://github.com/tenghuilee/LRQK. In this section, a series of experiments are conducted to rigorously evaluate the effectiveness and performance of our proposed method. |
| Researcher Affiliation | Academia | Tenghui Li Guangdong University of Technology RIKEN AIP EMAIL, Guoxu Zhou Guangdong University of Technology Key Laboratory of Intelligent Detection and the Internet of Things in Manufacturing, Ministry of Education, Guangzhou, CHINA EMAIL, Xuyang Zhao RIKEN i THEMS RIKEN IMS Chiba University EMAIL, Yuning Qiu RIKEN AIP EMAIL, Qibin Zhao RIKEN AIP EMAIL |
| Pseudocode | Yes | Algorithm 1 Prefill of LRQK, alternating updates for AQ, AK, BQ, BK |
| Open Source Code | Yes | Our code is available at https://github.com/tenghuilee/LRQK. |
| Open Datasets | Yes | Extensive experiments on the RULER and Long Bench benchmarks with LLa MA-3-8B and Qwen2.5-7B demonstrate that LRQK matches or surpasses leading sparse-attention methods in long context settings, while delivering significant memory savings with minimal accuracy loss. The accuracy of the proposed method is evaluated on the RULER dataset [28], using a long-context setting with a sequence length of 128K tokens. Experiments are conducted on the summarization task using the wikitext-2-v1 test set [25]. |
| Dataset Splits | Yes | The accuracy of the proposed method is evaluated on the RULER dataset [28], using a long-context setting with a sequence length of 128K tokens. Extensive experiments on the RULER and Long Bench benchmarks with LLa MA-3-8B and Qwen2.5-7B demonstrate that LRQK matches or surpasses leading sparse-attention methods in long context settings. Experiments are conducted on the summarization task using the wikitext-2-v1 test set [25]. |
| Hardware Specification | Yes | The experiments are performed on NVIDIA A100 GPUs. For the RULER 128K experiment, the model runs on a single NVIDIA A100 GPU with 80 GB of memory. For the Long Bench experiment, the model is executed on a single A100 GPU with 40 GB of memory. The model is running on a single A100 GPU with 40G memory. Due to the 48 GB memory constraints of the NVIDIA A6000 GPU, Mistral is evaluated on RULER 32K and Phi-3-mini is evaluated on RULER 16K. Experiments are conducted on NVIDIA Ge Force RTX 3090 (24 GB) GPUs. |
| Software Dependencies | Yes | All LLM evaluations are empowered by Open Compass [27]. Runtime performances of LRQK are evaluated on meta-llama/Llama-3.1-8B-Instruct-1M , comparing it against standard GPU-only and CPU offloading approaches. Experiments are conducted using the Hugging Face transformers library2 on a single NVIDIA A100 GPU (40 GB memory) with batch size 1. (Footnote 2: https://huggingface.co/docs/transformers/v4.52.2/en/main_classes/text_generation) |
| Experiment Setup | Yes | The number of max iteration and the tolerance introduced in Algorithm 1 and 2 are chosen as 2 and 0.01, respectively. All scaling parameters are set as λp Q = λp K = λd1 = λd2 = 1. For the low rank approximation, the rank is set to r = 32. The number of top-k tokens selected based on attention scores is set to 2048 (1.56% of 128K), while the number of lite tokens (the most recently generated tokens) is set to 64. The configuration for LRQK are: rank=16, top-256 active tokens, and 16 lite tokens. Default Configuration. It is recommended starting with the following default configuration, which achieves strong performance across diverse settings: rank r = 32, active tokens top-k = 2048, lite tokens 64, iterations 2, and tolerance 10 2. |