Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Spotlight Attention: Towards Efficient LLM Generation via Non-linear Hashing-based KV Cache Retrieval
Authors: Wenhao Li, Yuxin Zhang, Gen Luo, Haiyuan Wan, Ziyang Gong, Fei Chao, Rongrong Ji
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our contributions are threefold: We propose Spotlight Attention for accelerating LLM inference, which employs non-linear hashing function to encode and match queries and key values within LLMs, thereby efficiently selecting critical KV cache for model inference. We develop a lightweight and robust training framework based on the Bradley-Terry ranking objective, which effectively optimizes the non-linear hashing function using only a small amount of calibration data. Extensive experiments demonstrate that Spotlight Attention can drastically reduce LLM inference latency while maintaining the strongest performance retention in comparison with state-of-the-art methods. |
| Researcher Affiliation | Collaboration | Wenhao Li Xiamen University Yuxin Zhang Xiamen University Gen Luo Shanghai AI Laboratory Haiyuan Wan Shanghai AI Laboratory Ziyang Gong Shanghai Jiao Tong University Fei Chao Xiamen University Rongrong Ji Xiamen University |
| Pseudocode | Yes | We provide the ranking loss calculation in the following pseudo-code. To support longer sequence lengths during training, we employ three optimization techniques: (1) Random query selection, where only queries specified by query_index are optimized, rather than all queries. (2) Random top-k selection, where max_top is randomly sampled from the top-k set for optimization. (3) Random non-top-k selection, where max_oth is randomly sampled from the non-top-k set for optimization. These techniques enhance training efficiency in long-context scenarios. 1 def ranking_loss(...) |
| Open Source Code | Yes | All the training and evaluation stuff can be found at https://github.com/wenhaoli-xmu/spotlight. |
| Open Datasets | Yes | We evaluated three language modeling benchmarks PG19 [19], Proof Pile [5], and Code Parrot [18] with 100, 79, and 100 samples, respectively, using perplexity to detect minor errors from sparsification. ... We evaluated the performance of various compression methods on Long Bench [6] subtasks. ... Training data consists of 8,192 samples, evenly drawn from the Book and Arxiv datasets [22]. ... We use the offline evaluation version of NIAH [1]... To validate our method s applicability across diverse tasks, we trained models on the Git Hub Code [18] and C4 [18] datasets... |
| Dataset Splits | Yes | Training data consists of 8,192 samples, evenly drawn from the Book and Arxiv datasets [22]. ... We evaluated three language modeling benchmarks PG19 [19], Proof Pile [5], and Code Parrot [18] with 100, 79, and 100 samples, respectively... For LLa MA2 models, we evaluated the first 4K tokens per sample; for LLa MA3 and Qwen2.5 models, we used the first 8K tokens. |
| Hardware Specification | Yes | For example, our method achieves up to 3 increase in Qwen2.5-7B [24] inference throughput for both 32K and 128K sequences, with only 2% performance degradation on the LLa MA3 [13] series and no loss on Qwen2.5 [24] series. ... For example, our method achieves up to 3 increase in Qwen2.5-7B [24] inference throughput for both 32K and 128K sequences, with only 2% performance degradation on the LLa MA3 [13] series and no loss on Qwen2.5 [24] series. ... All efficiency experiments were performed on Qwen2.5-7B [24] using eight A100 GPUs. ... achieving hashing retrieval for 512K tokens in under 100µs on a single A100 GPU |
| Software Dependencies | No | For enhanced flexibility, experiments utilized the Hugging Face Transformers framework, optimized with pipeline parallelization and KV cache pre-allocation to boost throughput. ... For top-k gathering and sparse attention, we employed Torch and Flash Attention implementations, respectively. ... Bit-packing (Figure 9) is crucial because Py Torch lacks a native bit type... |
| Experiment Setup | Yes | Our MLP hashing function employs 128-dimensional input, intermediate, and output layers, with a distinct MLP for each head in every layer, producing a 128-bit hash code much shorter than Magic PIG s minimum of 720 bits. Only the hashing functions are trainable. Training data consists of 8,192 samples, evenly drawn from the Book and Arxiv datasets [22]. To improve efficiency, hidden states for all layers are precomputed and stored, enabling independent layer-wise training without joint fine-tuning. Training uses γ = 64, a learning rate of 1 10 3, β = 1, and α = 3, for one epoch. Additional details are provided in Appendix A.1. The pruning rate remains fixed at 98% during training, irrespective of evaluation settings. ... Table 7: Detailed training configuration. General Learning Rate Gradient Precision Num Iters Batch Size Max LR Min LR Warm Up Iters Warm Up Method Annealing Accumulation Clipping bf16 8,192 1 0.001 0 81 linear cosine 1 1.0 Optimizer Data Optimizer β1 β2 Weight Decay Corpus Arxiv Samples Book Samples LLa MA2 Trunc LLa MA3/Qwen Trunc Trunc Side adamw 0.9 0.98 0.1 arxiv, book 4,096 4,096 4.096 8,192 right |