Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Too Late to Recall: Explaining the Two-Hop Problem in Multimodal Knowledge Retrieval

Authors: Constantin Venhoff, Ashkan Khakzar, Sonia Joseph, Philip H.S. Torr, Neel Nanda

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	By benchmarking 14 VLMs with various architectures (LLa VA, Native, Cross Attention), sizes (7B-124B parameters), and training setups on factual recall tasks against their original LLM backbone models, we find that 11 of 14 models exhibit factual recall degradation. We select three models exhibiting highand two models with low performance degradation, and use attribution patching, activation patching, and probing to show that degraded VLMs struggle to use the existing factual recall circuit of their LLM backbone, because they resolve the first hop too late in the computation.
Researcher Affiliation	Collaboration	1University of Oxford 2Mc Gill University 3Meta 4MATS
Pseudocode	No	Section 3.1 "Attribution Patching" and Section 4.1 "Heuristic Patching" provide numbered lists of steps describing a procedure. However, these are descriptive lists of actions rather than structured steps formatted like code with variables, control flow, or programmatic constructs typically found in pseudocode or algorithm blocks.
Open Source Code	Yes	Additionally, the code for all experiments and benchmark generation will be made publicly available to facilitate full reproducibility. The paper provides open access to both the code and the benchmark dataset. These resources are included in the supplementary materials for reviewers and will also be made publicly available upon publication.
Open Datasets	Yes	We sample images from the Wikipedia-based Image Text (WIT) dataset [Srinivasan et al., 2021] and use GPT-4.1 to generate entity-specific factual prompts (e.g. Who invented the entity shown in the image? ). We use a subset of Image Net-100 [Deng et al., 2009], restricted to 50 of the 100 classes, as our evaluation dataset
Dataset Splits	Yes	We use a 20%/80% train-test split to evaluate the probes. Each VLM/backbone pair answers factual recall questions until 1000 valid samples are answered from the 15000-question pool.
Hardware Specification	Yes	We used an NVIDIA A100 GPU with 80GB memory for all local experiments.
Software Dependencies	No	The paper mentions using GPT-4.1 for data generation, but does not provide specific version numbers for any software libraries, frameworks, or programming languages used for implementing their methodology.
Experiment Setup	Yes	To determine which MLPand Attention-sublayers in the LLM and VLM contribute most to factual recall, we employ an attribution patching methodology inspired by Nanda and Meng et al.. First we sample 100 correctly answered examples from the benchmark dataset for each VLM and their original LLM backbone model. Then we use the following steps to compute the attribution scores: ... The noise multiplier determines how strongly we corrupt the entity input embeddings (text tokens or image tokens). ... We provide GPT-4.1 with the entity class, and prompt for a concise factual recall question. ... We use a 20%/80% train-test split to evaluate the probes. ... To test this, we alter the prompting template in the benchmark and include a chain of thought prompt (see Appendix F for prompt).