Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

MPCache: MPC-Friendly KV Cache Eviction for Efficient Private LLM Inference

Authors: Wenxuan Zeng, Ye Dong, Jinjin Zhou, Jin Tan, Lei Wang, Tao Wei, Runsheng Wang, Meng Li

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate that MPCACHE consistently outperforms prior-art KV cache eviction baselines across different generation tasks and achieves 1.8 2.01 and 3.39 8.37 decoding latency and communication reduction on different sequence lengths, respectively.
Researcher Affiliation Collaboration Wenxuan Zeng1, Ye Dong2, Jinjin Zhou3, Jin Tan3, Lei Wang3 Tao Wei3, Runsheng Wang1, Meng Li1 1Peking University 2National University of Singapore 3Ant Group
Pseudocode Yes Algorithm 1: Formulation of KV cache eviction, Algorithm 2: Token gathering protocol ΠGather for retrieving one token, Algorithm 3: MPCACHE: KV cache eviction combining static and dynamic algorithm
Open Source Code Yes The code can be found here.
Open Datasets Yes Our experiments are conducted on LLa MA-2, Long Chat-7B-V1.5-32K, and LLa MA-3.1-8B-Instruct on Long Bench [2], XSUM [57], and Needle-in-a-Haystack [23]. Refer to Appendix F.1 for details.
Dataset Splits No The paper uses Long Bench [2], XSUM [57], and Needle-in-a-Haystack [23] but does not explicitly describe the training, validation, or test splits for these datasets. Appendix F.1, cited for detailed setups, describes KV cache clustering and static eviction settings but not dataset splits.
Hardware Specification Yes The model performance is evaluated with Long Bench on an NVIDIA A100 80GB GPU in Py Torch. The latency is evaluated with Secretflow under the LAN setup [63] with 377MBps bandwidth and 0.3ms echo latency [63] on Intel(R) Xeon(R) Gold 5220R CPU @ 2.20GHz.
Software Dependencies Yes The latency is evaluated with Secretflow (SPU V0.9.1) [55] and follow the 3PC protocols of PUMA [19].
Experiment Setup Yes KV cache clustering configuration. For hierarchy, we in practice choose a two-level hierarchical structure, i.e., n = 2, and when the final dynamic selection ratio α < 0.5, we drop 50% clusters at the 1st hierarchical level. For the XSUM dataset, we use a cluster size of 8 at the 1st hierarchical level and 4 at the 2nd hierarchical level. For the long-context Long Bench, we use larger clusters, i.e., 32 at the 1st hierarchical level and 16 at the 2nd hierarchical level. [...] We empirically choose α = 0.6, and leave more discussions to Appendix F.3 and a theoretical analysis to Appendix G.