Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

KVLink: Accelerating Large Language Models via Efficient KV Cache Reuse

Authors: Jingbo Yang, Bairu Hou, Wei Wei, Yujia Bao, Shiyu Chang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments across 7 datasets demonstrate that KVLINK improves question answering accuracy by an average of 4% over state-of-the-art methods. Furthermore, by leveraging precomputed KV caches, our approach reduces time-to-first-token by up to 96% compared to standard LLM inference, making it a scalable and efficient solution for context reuse.
Researcher Affiliation	Collaboration	Jingbo Yang Department of Computer Science UC Santa Barbara EMAIL Bairu Hou Department of Computer Science UC Santa Barbara EMAIL Wei Wei Center for Advanced AI Accenture EMAIL Yujia Bao Center for Advanced AI Accenture EMAIL Shiyu Chang Department of Computer Science UC Santa Barbara EMAIL
Pseudocode	No	The paper describes methodologies and mechanisms in prose, supported by figures, but does not present any formal pseudocode or algorithm blocks.
Open Source Code	Yes	Code is available at https://github.com/UCSB-NLP-Chang/KVLink.
Open Datasets	Yes	We evaluate the effectiveness of our method through comprehensive experiments across diverse question-answering and text summarization datasets... For example, KVLINK surpasses the best baseline by 6.6% on Natural Question [10] and 7.3% on Hotpot QA [11]. The training dataset by mixing the training sets of 2Wiki MQA [16], Trivia QA [17], pretraining data from Fine Web [18], and TÜLU 3 [19].
Dataset Splits	Yes	We construct the training dataset by mixing the training sets of 2Wiki MQA [16], Trivia QA [17], pretraining data from Fine Web [18], and TÜLU 3 [19]. Further details on data preprocessing, dataset mixture, and training configurations are provided in Appendix A.1 and A.2. For Natural Questions, we adopt the evaluation protocol from Liu et al. [21]... For 2Wiki MQA, Hotpot QA, and Mu Si Que, we utilize the originally provided retrieved documents for evaluation. For Trivia QA we retrieve 10 documents... In all cases, documents are encoded separately into KV cache.
Hardware Specification	Yes	fine-tuning them for 6,000 steps using a global batch size of 64 across 8 H100 GPUs.
Software Dependencies	No	The paper mentions using Llama models and external tools like Contriever and GPT-4 for data generation, but it does not provide specific version numbers for software dependencies like programming languages or libraries.
Experiment Setup	Yes	We adopt Llama-3.2-1B-Instruct, Llama-3.2-3B-Instruct and Llama-3.1-8B-Instruct as the backbone models, fine-tuning them for 6,000 steps using a global batch size of 64 across 8 H100 GPUs. All training examples are truncated to a maximum length of 4096 tokens.