Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

R-KV: Redundancy-aware KV Cache Compression for Reasoning Models

Authors: Zefan Cai, Wen Xiao, Hanshi Sun, cheng Luo, Yikai Zhang, Ke Wan, Yucheng Li, Yeyang Zhou, Li-Wen Chang, Jiuxiang Gu, Zhen Dong, Animashree Anandkumar, Abedelkadir Asi, Junjie Hu

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results show that R-KV consistently outperforms existing KV cache compression baselines across two mathematical reasoning datasets.
Researcher Affiliation	Collaboration	1University of Wisconsin Madison 2Microsoft 3Carnegie Mellon University 4California Institute of Technology 5University of California San Diego 6University of Surrey 7Adobe 8University of California Berkeley
Pseudocode	Yes	A.1 Algorithm The pseudo-code of the method is shown in Algorithm 1.
Open Source Code	Yes	https://github.com/Zefan-Cai/R-KV
Open Datasets	Yes	We evaluate the models mathematical reasoning capabilities using three benchmarks: MATH-500 [8] and AIME 2024 [9].
Dataset Splits	Yes	We evaluate the models mathematical reasoning capabilities using three benchmarks: MATH-500 [8] and AIME 2024 [9]. ... Following existing works [1], we utilize pass@k evaluation [10] and report pass@1 using a non-zero temperature. ... We generate 64 responses for each question.
Hardware Specification	Yes	We use NVIDIA A100 80G to finish all the experiments.
Software Dependencies	No	No specific software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions) are mentioned in the text.
Experiment Setup	Yes	Hyperparameters We set Bbuffer = 128, α = 8 and λ = 0.1, with an analysis of λ in 5.1. ... We set the maximum generation length to 16,384 tokens for MATH-500 and 32,768 tokens for AIME 2024 and AIME 2025 ... We use the recommended sampling temperature and top-p value for each model, i.e., sampling temperature of 0.6 and a top-p value of 0.95 for Deep Seek R1 Distilled models. We generate 64 responses for each question.