Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

SALS: Sparse Attention in Latent Space for KV Cache Compression

Authors: Junlin Mu, Hantao Huang, Jihang Zhang, Minghui Yu, Tao Wang, Yidong Li

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We comprehensively evaluate SALS on various tasks using two large-scale models: LLa MA2-7b-chat and Mistral-7b, and additionally verify its scalability on the RULER-128k benchmark with LLa MA3.1-8B-Instruct. Experimental results demonstrate that SALS achieves SOTA performance by maintaining competitive accuracy.
Researcher Affiliation Collaboration 1Key Laboratory of Big Data & Artificial Intelligence in Transportation, Ministry of Education, China 2School of Computer Science & Technology, Beijing Jiaotong University, China 3Byte Dance Seed, China
Pseudocode Yes Algorithm 1 Sparse Attention in Latent Space
Open Source Code No The source code will be publicly available in the future.
Open Datasets Yes We evaluate our SALS framework on two mainstream 7B chat models: LLa MA2 7B Chat [23], that employs multi head attention (MHA), and Mistral 7B v0.2 [11], that uses grouped query attention (GQA). To assess accuracy, we report results on the reasoning benchmark GSM8K [6], the conversational benchmark Co QA [16], and the 16 English subsets of Long Bench [2] that probe long context understanding. ... To obtain the latent projection matrix, we randomly sample 512 sequences of length 4096 from the C4 corpus [9] and compute it offline. ... RULER [10] is a recently proposed long-context benchmark...
Dataset Splits Yes For the dataset GSM8K and Co QA, we always keep the most recent w = 128 tokens and decode the remaining context at a fixed sparsity of 1/4. For the dataset Long Bench, LLa MA2 7B Chat supports a 4 k context window, whereas Mistral 7B v0.2 supports 32 k. To equalise the average sparsity at 1/8 on this benchmark, we retain Nc = 512 tokens for LLa MA2 7B Chat and Nc = 1024 tokens for Mistral 7B v0.2. For LLa MA2 7B Chat we follow the Hshare [25] configuration (x = 16 sink tokens, y = 432 critical tokens, z = 64 recent tokens) and simply double each count for Mistral 7B v0.2.
Hardware Specification Yes All experiments are conducted on a machine with Xeon(R) Platinum 8336C CPU, one GPU (ampere architecture), and 128G RAM.
Software Dependencies No Following Hshare [25], all speed tests are performed on the Py Torch backend, with the Triton based fused kernel as discussed in Sec 4.5. (No version numbers provided for PyTorch or Triton.)
Experiment Setup Yes To obtain the latent projection matrix, we randomly sample 512 sequences of length 4096 from the C4 corpus [9] and compute it offline. We apply multi-head joint compression to the key cache at two ratios: dr = 25% and dr = 12.5%. For latent scoring, we simply set r = 0.5r for all models. Since value tensors are almost full rank and play a pivotal role in attention, we forgo low-rank projection for them and instead perform channel-wise group quantisation that mirrors the key-cache setting (4 bit at 25% and 2 bit at 12.5%). Following KIVI, we adopt a mixed high precision / low precision scheme: tokens in the most recent window are compressed by only 50%, whereas all preceding tokens are compressed according to the target ratio of the experiment. The high precision window is aligned with the sparsity window: when the sparse mechanism always selects the most recent w tokens, the compression stage likewise keeps those same w tokens in high precision. Based on the observations in Figure 2, we skip the sparsification for layers 0, 1, and 31 across all models to ensure more accurate compression and sparsity results.