Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Understanding Parametric and Contextual Knowledge Reconciliation within Large Language Models

Authors: Jun Zhao, Yongzhuo Yang, Xiang Hu, Jingqi Tong, Yi Lu, Wei Wu, Tao Gui, Qi Zhang, Xuanjing Huang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We first verify this approach through an attribution experiment, demonstrating that it can accurately detect information about ad-hoc entities from complex hidden states. Next, we trace entity flows across layers to understand how LLMs reconcile conflicting knowledge internally. Our probing results reveal that contextual and parametric knowledge are routed between tokens through distinct sets of attention heads, supporting attention competition only within knowledge types. ... To validate whether our entity-aware probe accurately reflects the model s internal beliefs about entity relevance, we conduct a controlled attribution task... We evaluate attribution performance on three established QA benchmarks Hotpot QA [41], Trivia QA [18], and SQuAD [30] adapted for both passage-level and sentence-level attribution tasks. We measure attribution accuracy using standard information retrieval metrics: Precision, Recall and F1. ... Figure 3 presents the experimental results for LLa MA3.1
Researcher Affiliation	Collaboration	1College of Computer Science and Artificial Intelligence, Fudan University 2Shanghai Collaborative Innovation Center of Intelligent Visual Computing 3Shanghai Innovation Institute 4Ant Group
Pseudocode	No	The paper describes its methodologies and frameworks using textual explanations, mathematical formulations, and figures, but it does not include any explicitly labeled pseudocode or algorithm blocks with structured, code-like formatting.
Open Source Code	No	Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [No] Justification: We are eager to release the code and data publicly; however, due to ongoing company approval processes, we are currently unable to share them immediately. We plan to make them available as soon as these procedures are completed.
Open Datasets	Yes	We construct probe training data by adapting the Hotpot QA-distractor dataset [41], leveraging its sentence-level human-annotated evidence to precisely identify entities required for answering queries. ... We evaluate attribution performance on three established QA benchmarks Hotpot QA [41], Trivia QA [18], and SQuAD [30] adapted for both passage-level and sentence-level attribution tasks.
Dataset Splits	No	We construct probe training data by adapting the Hotpot QA-distractor dataset [41]... From 6,200 processed Hotpot QA samples, each generates training instances si =< Di, qi, {e<j> i , y<j> i }N j=1 >... Hotpot QA serves as both the training set of the probe and a sentence-level attribution testing set (see Section 3.3 for training details), while SQuAD and Trivia QA are exclusively used for passage-level attribution evaluation (without dataset-specific probe fine-tuning).
Hardware Specification	Yes	Training the probe requires approximately 3 days on 8 A100 GPUs, and testing on a single dataset takes about 1 hour.
Software Dependencies	No	The experiments were conducted using the Openrlhf framework with Deep Speed for distributed training. The model was trained with BF16 mixed precision and Flash Attention optimization to accelerate computation and reduce memory overhead.
Experiment Setup	Yes	Training utilized a global batch size of 256 with a micro-batch size of 4 for memory-efficient gradient accumulation. The Adam W optimizer was configured with a learning rate of 1e-5 and Ze RO Stage 2 optimization. The model was trained for 2 epochs. The input sequence length was capped at 3,000 tokens.