Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

MPCache: MPC-Friendly KV Cache Eviction for Efficient Private LLM Inference

Authors: Wenxuan Zeng, Ye Dong, Jinjin Zhou, Jin Tan, Lei Wang, Tao Wei, Runsheng Wang, Meng Li

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments demonstrate that MPCACHE consistently outperforms prior-art KV cache eviction baselines across different generation tasks and achieves 1.8 2.01 and 3.39 8.37 decoding latency and communication reduction on different sequence lengths, respectively.
Researcher Affiliation	Collaboration	Wenxuan Zeng1, Ye Dong2, Jinjin Zhou3, Jin Tan3, Lei Wang3 Tao Wei3, Runsheng Wang1, Meng Li1 1Peking University 2National University of Singapore 3Ant Group
Pseudocode	Yes	Algorithm 1: Formulation of KV cache eviction, Algorithm 2: Token gathering protocol ΠGather for retrieving one token, Algorithm 3: MPCACHE: KV cache eviction combining static and dynamic algorithm
Open Source Code	Yes	The code can be found here.
Open Datasets	Yes	Our experiments are conducted on LLa MA-2, Long Chat-7B-V1.5-32K, and LLa MA-3.1-8B-Instruct on Long Bench [2], XSUM [57], and Needle-in-a-Haystack [23]. Refer to Appendix F.1 for details.
Dataset Splits	No	The paper uses Long Bench [2], XSUM [57], and Needle-in-a-Haystack [23] but does not explicitly describe the training, validation, or test splits for these datasets. Appendix F.1, cited for detailed setups, describes KV cache clustering and static eviction settings but not dataset splits.
Hardware Specification	Yes	The model performance is evaluated with Long Bench on an NVIDIA A100 80GB GPU in Py Torch. The latency is evaluated with Secretflow under the LAN setup [63] with 377MBps bandwidth and 0.3ms echo latency [63] on Intel(R) Xeon(R) Gold 5220R CPU @ 2.20GHz.
Software Dependencies	Yes	The latency is evaluated with Secretflow (SPU V0.9.1) [55] and follow the 3PC protocols of PUMA [19].
Experiment Setup	Yes	KV cache clustering configuration. For hierarchy, we in practice choose a two-level hierarchical structure, i.e., n = 2, and when the final dynamic selection ratio α < 0.5, we drop 50% clusters at the 1st hierarchical level. For the XSUM dataset, we use a cluster size of 8 at the 1st hierarchical level and 4 at the 2nd hierarchical level. For the long-context Long Bench, we use larger clusters, i.e., 32 at the 1st hierarchical level and 16 at the 2nd hierarchical level. [...] We empirically choose α = 0.6, and leave more discussions to Appendix F.3 and a theoretical analysis to Appendix G.