Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Graph-KV: Breaking Sequence via Injecting Structural Biases into Large Language Models

Authors: Haoyu Wang, Peihao Wang, Mufei Li, Shikun Liu, Siqi Miao, Zhangyang "Atlas" Wang, Pan Li

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate Graph-KV across three scenarios: (1) seven RAG benchmarks spanning direct inference, multi-hop reasoning, and long-document understanding; (2) ARXIV-QA, a novel academic paper QA task with full-text scientific papers structured as citation ego-graphs; and (3) paper topic classification within a citation network. By effectively reducing positional bias and harnessing structural inductive biases, Graph-KV substantially outperforms baselines, including standard costly sequential encoding, across various settings.
Researcher Affiliation	Academia	1 Georgia Institute of Technology 2 The University of Texas at Austin
Pseudocode	No	The methodology section '3 Methodology' describes the approach using text and mathematical formulas but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Code and the ARXIV-QA data are publicly available at https://github.com/ Graph-COM/Graph KV.
Open Datasets	Yes	Code and the ARXIV-QA data are publicly available at https://github.com/ Graph-COM/Graph KV. We evaluated Graph-KV across three diverse settings. First, Graph-KV was assessed on seven RAG benchmarks, covering direct inference [22, 25], multi-hop reasoning [49, 17, 44, 62], and long-document understanding [4]. Second, we introduced ARXIV-QA, a novel and challenging task featuring real-world graph biases. In ARXIV-QA, questions are constructed from the full text of a central scientific paper and its linked references, sourced from the ar Xiv citation network [20]. Third, Graph-KV was evaluated on paper topic classification tasks within citation networks, which possess inherent structural biases through citation links using the Cora [36] and Pubmed [45] citation graphs.
Dataset Splits	Yes	The dataset is originally from Cora [36] and Pubmed [45], we adopt the test set split adopted in [7]. For each paper, the input text consists of the title and abstract. For ARXIV-QA, We curated a final dataset comprising 60 primary papers. For all the datasets [RAG], 10 text chunks are provided, and accuracy is selected as the primary metric.
Hardware Specification	Yes	For all the experiments involved in this study, the code is implemented using Py Torch [37], the Hugging Face Transformers library [57], and Flash Attention-2 [9]. As to hardware, for the task ARXIV-QA, the parallel text encoding baselines (Block-RAG, PCW, APE) and Graph-KV run on 4 NVIDIA A100 Tensor Core GPUs, while the sequential encoding baseline runs on 8 NVIDIA A100 Tensor Core GPUs, as it requires higher memory. For the other tasks, all the methods run on with NVIDIA RTX 6000 Ada GPUs. We conduct stress test on synthetic data with an Nvidia RTX6000 GPU (48GB) with AMD EPYC 7763 64-core processor, to compare Graph-KV with sequential encoding baseline on scalability and efficiency.
Software Dependencies	No	The paper mentions key software components like 'Py Torch [37]', 'Hugging Face Transformers library [57]', and 'Flash Attention-2 [9]', but does not explicitly state specific version numbers for these software components. For example, it does not say 'PyTorch 1.9' but refers to it by citation.
Experiment Setup	Yes	For all the datasets [RAG], 10 text chunks are provided, and accuracy is selected as the primary metric. We follow [35, 33, 18] to judge whether any correct answers appear in the predicted output. For all methods, the output is constrained to a maximum of 256 tokens. For sequential encoding and Block-RAG [35], we report the average performance with seeds 42 to 44 to randomly shuffle the placement order of sequence. For RAG tasks, the entire prompt input is divided into 3 parts, namely Prefix, Text Chunks, and Question, with each formatted as follows: You are an intelligent AI assistant. Please answer questions based on the user's instructions. Below are some reference documents that may help you in answering the user's question. Text Chunks: -Title: {Title #1}. {Text #1} ... Question: Please write a high-quality answer for the given question using only the provided search documents (some of which might be irrelevant). Question: {Question}.