Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
MUSTAFAR: Promoting Unstructured Sparsity for KV Cache Pruning in LLM Inference
Authors: Donghyeon Joo, Helya Hosseini, Ramyad Hadidi, Bahar Asgari
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate that unstructured sparsity significantly improves KV cache compression for LLMs, enabling sparsity levels up to 70% without compromising accuracy or requiring fine-tuning. We conduct a systematic exploration of pruning strategies and find per-token magnitude-based pruning as highly effective for both Key and Value caches under unstructured sparsity, surpassing prior structured pruning schemes. The Key cache benefits from prominent outlier elements, while the Value cache surprisingly benefits from a simple magnitude-based pruning despite its uniform distribution. KV cache size is the major bottleneck in decode performance due to high memory overhead for large context lengths. To address this, we use a bitmap-based sparse format and a custom attention kernel capable of compressing and directly computing over compressed caches pruned to arbitrary sparsity patterns, significantly accelerating memory-bound operations in decode computations and thereby compensating for the overhead of runtime pruning and compression. Our custom attention kernel coupled with the bitmap-based format delivers substantial compression of KV cache up to 45% of dense inference and thereby enables longer context lengths and increased tokens/sec throughput of up to 2.23 compared to dense inference. Our pruning mechanism and sparse attention kernel is available at https://github.com/dhjoo98/mustafar. |
| Researcher Affiliation | Collaboration | Donghyeon Joo1, Helya Hosseini1, Ramyad Hadidi2, Bahar Asgari1 1Department of Computer Science, University of Maryland, 2d-Matrix EMAIL, EMAIL |
| Pseudocode | Yes | Figure 5b and Algorithm 1 presents the Mustafar sparse attention kernel. |
| Open Source Code | Yes | Our pruning mechanism and sparse attention kernel is available at https://github.com/dhjoo98/mustafar. |
| Open Datasets | Yes | For accuracy evaluation, we use tasks from Long Bench [3] to test the accuracy retention of per-token magnitude-based pruning of KV cache. We evaluate on three models: Llama-2-7B [40], Llama-3-8B-Instruct [12], and Mistral-7B-Instruct-v0.2 [19]. ... Additionally, we provide accuracy evaluation on RULER [17] benchmark in Appendix A.3 |
| Dataset Splits | Yes | For Llama-2 7B, we used input sequence length of 2048 and generated 2048 tokens. For Llama-3 8B, we use input sequence length of 4096 and generated 4096 tokens. For dense baseline, Flash Attention [6] was used on prefill and decode phase. Overall, we see that Mustafar is able to achieve higher throughput as well as support larger batch size owing to the reduced memory footprint of KV cache. In Llama-3, we see that enabling batch size of 8 leads to 2.23 tokens/sec throughput compared the dense inference of batch size 6. Even within the same batch size, we see an increased throughput upto 1.89 . This is due to the pruning and compression overhead amortized by the speedup of Mustafar sparse attention kernel, leading to faster inference latency. However in batch size 1, we see that throughput is lower than dense inference. This is due to the underutilization of GPU in Mustafar sparse attention kernel with small batch size, where the number of threadblocks is smaller than the number of SMs. We provide additional throughput comparison with different input:output token ratios on Appendix C.3. |
| Hardware Specification | Yes | Efficiency evaluation is tested on NVIDIA RTX 6000ADA GPU and measured with NVIDIA Nsight Profiling Tool. |
| Software Dependencies | No | The paper mentions software like Flash Attention and cuBLAS but does not provide specific version numbers for these or other key software components required for reproduction. |
| Experiment Setup | Yes | Methodology: We evaluate Mustafar on two aspects: Accuracy and Efficiency. For accuracy evaluation, we use tasks from Long Bench [3] to test the accuracy retention of per-token magnitude-based pruning of KV cache. We evaluate on three models: Llama-2-7B [40], Llama-3-8B-Instruct [12], and Mistral-7B-Instruct-v0.2 [19]. We also explore the impact of Mustafar when jointly used with orthogonal compression techniques, KV cache quantization KIVI [30] and token-wise eviction H2O [51]. For efficiency evaluation, we evaluate the impact on KV cache compression and computational latency with Llama-2-7B and Llama-3-8B-Instruct. Efficiency evaluation is tested on NVIDIA RTX 6000ADA GPU and measured with NVIDIA Nsight Profiling Tool. Additionally, we provide accuracy evaluation on RULER [17] benchmark in Appendix A.3, accuracy comparison of Mustafar s unstructured sparsity with 2:4 semi-structured sparsity in Appendix B, and additional kernel throughput evaluation in Appendix C.3. ... Previous works [51, 44] have shown that this is effective in preserving model accuracy, meanwhile being small enough in size to not severely impact the compression. ... For Llama-2 7B, we used input sequence length of 2048 and generated 2048 tokens. For Llama-3 8B, we use input sequence length of 4096 and generated 4096 tokens. |