Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Generalization or Hallucination? Understanding Out-of-Context Reasoning in Transformers

Authors: Yixiao Huang, Hanlin Zhu, Tianyu Guo, Jiantao Jiao, Somayeh Sojoudi, Michael I Jordan, Stuart J Russell, Song Mei

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments across five prominent LLMs confirm that OCR indeed drives both generalization and hallucination, depending on whether the associated concepts are causally related. ... To verify that OCR can induce both generalization and hallucination in LLMs, we conduct experiments on a synthetic dataset on five popular models...
Researcher Affiliation	Academia	Yixiao Huang UC Berkeley EMAIL Hanlin Zhu UC Berkeley EMAIL Tianyu Guo UC Berkeley EMAIL Jiantao Jiao UC Berkeley EMAIL Somayeh Sojoudi UC Berkeley EMAIL Michael I. Jordan UC Berkeley EMAIL Stuart Russell UC Berkeley EMAIL Song Mei UC Berkeley EMAIL
Pseudocode	No	The paper contains mathematical formulations and theoretical analysis but does not include any explicitly labeled "Pseudocode" or "Algorithm" blocks or figures.
Open Source Code	Yes	Our code is released at https://github.com/yixiao-huang/OCR-Theory.
Open Datasets	Yes	Following Feng et al. [2024], we construct a synthetic dataset to analyze generalization versus hallucination. ... we extended the LLM experiments in Section 2 to Pop QA [Mallen et al., 2022], which is a large-scale open-domain question answering (QA) dataset.
Dataset Splits	Yes	We then create training and test sets for each subset by splitting its subjects with a 0.2 training ratio, resulting in 20% training subjects and 80% test subjects.
Hardware Specification	Yes	The experiments for the one-layer model were run on a single NVIDIA A100 GPU. LLM Experiments were run on a cluster of 4 NVIDIA A100 GPUs and took less than an hour for each run.
Software Dependencies	No	Throughout the paper, we finetune the models using the cross-entropy loss with Adam W optimizer [Kingma, 2014]. This mentions a specific optimizer but no version numbers for it or any other software libraries or frameworks.
Experiment Setup	Yes	For experiments on LLMs, we use full batch and train for 100 epochs. Similar to Feng et al. [2024], we notice that OCR is sensitive to learning rates and thus we sweep across different learning rates in {10 6, 3 10 6, 10 5, 3 10 5, 10 4, 3 10 4} for each model and relation pair and report the results with the lowest test rank. For the one-layer linear attention model, we train the model with one-hot token embedding with d = 128 for 2 104 steps with learning rate 5 10 4.