Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

SALS: Sparse Attention in Latent Space for KV Cache Compression

Authors: Junlin Mu, Hantao Huang, Jihang Zhang, Minghui Yu, Tao Wang, Yidong Li

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We comprehensively evaluate SALS on various tasks using two large-scale models: LLa MA2-7b-chat and Mistral-7b, and additionally verify its scalability on the RULER-128k benchmark with LLa MA3.1-8B-Instruct. Experimental results demonstrate that SALS achieves SOTA performance by maintaining competitive accuracy.
Researcher Affiliation	Collaboration	1Key Laboratory of Big Data & Artiﬁcial Intelligence in Transportation, Ministry of Education, China 2School of Computer Science & Technology, Beijing Jiaotong University, China 3Byte Dance Seed, China
Pseudocode	Yes	Algorithm 1 Sparse Attention in Latent Space
Open Source Code	No	The source code will be publicly available in the future.
Open Datasets	Yes	We evaluate our SALS framework on two mainstream 7B chat models: LLa MA2 7B Chat [23], that employs multi head attention (MHA), and Mistral 7B v0.2 [11], that uses grouped query attention (GQA). To assess accuracy, we report results on the reasoning benchmark GSM8K [6], the conversational benchmark Co QA [16], and the 16 English subsets of Long Bench [2] that probe long context understanding. ... To obtain the latent projection matrix, we randomly sample 512 sequences of length 4096 from the C4 corpus [9] and compute it ofﬂine. ... RULER [10] is a recently proposed long-context benchmark...
Dataset Splits	Yes	For the dataset GSM8K and Co QA, we always keep the most recent w = 128 tokens and decode the remaining context at a fixed sparsity of 1/4. For the dataset Long Bench, LLa MA2 7B Chat supports a 4 k context window, whereas Mistral 7B v0.2 supports 32 k. To equalise the average sparsity at 1/8 on this benchmark, we retain Nc = 512 tokens for LLa MA2 7B Chat and Nc = 1024 tokens for Mistral 7B v0.2. For LLa MA2 7B Chat we follow the Hshare [25] conﬁguration (x = 16 sink tokens, y = 432 critical tokens, z = 64 recent tokens) and simply double each count for Mistral 7B v0.2.
Hardware Specification	Yes	All experiments are conducted on a machine with Xeon(R) Platinum 8336C CPU, one GPU (ampere architecture), and 128G RAM.
Software Dependencies	No	Following Hshare [25], all speed tests are performed on the Py Torch backend, with the Triton based fused kernel as discussed in Sec 4.5. (No version numbers provided for PyTorch or Triton.)
Experiment Setup	Yes	To obtain the latent projection matrix, we randomly sample 512 sequences of length 4096 from the C4 corpus [9] and compute it ofﬂine. We apply multi-head joint compression to the key cache at two ratios: dr = 25% and dr = 12.5%. For latent scoring, we simply set r = 0.5r for all models. Since value tensors are almost full rank and play a pivotal role in attention, we forgo low-rank projection for them and instead perform channel-wise group quantisation that mirrors the key-cache setting (4 bit at 25% and 2 bit at 12.5%). Following KIVI, we adopt a mixed high precision / low precision scheme: tokens in the most recent window are compressed by only 50%, whereas all preceding tokens are compressed according to the target ratio of the experiment. The high precision window is aligned with the sparsity window: when the sparse mechanism always selects the most recent w tokens, the compression stage likewise keeps those same w tokens in high precision. Based on the observations in Figure 2, we skip the sparsiﬁcation for layers 0, 1, and 31 across all models to ensure more accurate compression and sparsity results.