Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

How do Transformers Learn Implicit Reasoning?

Authors: Jiaran Ye, Zijun Yao, Zhidian Huang, Liangming Pan, Jinxin Liu, Yushi Bai, Amy Xin, Liu Weichuan, Xiaoyin Che, Lei Hou, Juanzi Li

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our empirical analysis begins with a behavioral study conducted under fine-grained experimental control (Section 2). Under a complete training configuration, we observe that multi-hop implicit reasoning emerges in three distinct stages: memorization, in-distribution generalization, and finally cross-distribution generalization. Through ablation studies, we further demonstrate that while exposure to in-distribution (ID) triples is not strictly necessary for achieving in-distribution generalization, its absence significantly delays the onset of this behavior.
Researcher Affiliation	Collaboration	DCST, BNRist; KIRC, Institute for Artificial Intelligence, Tsinghua University, China MOE Key Lab of Computational Linguistics, Peking University, China Siemens AG, China
Pseudocode	No	The paper describes methods and procedures in paragraph text and figures, but it does not contain a clearly labeled 'Pseudocode' or 'Algorithm' block, nor structured steps formatted like code.
Open Source Code	Yes	https://github.com/Jiaran-Ye/Implicit Reasoning
Open Datasets	No	Our data construction pipeline is adapted from the open-source code released by Wang et al. [25], with modifications to support fine-grained query-level control. In all configurations, we construct a symbolic environment consisting of 2000 entities and 200 relations, each assigned a unique token with no inherent semantics.
Dataset Splits	Yes	We construct the training set with a 7.2:1 ratio of Train-II queries to ID atomic triples to ensure compositional supervision dominates, and sample a fixed test set of 3,000 examples for each type. Table 2 summarizes the key dataset statistics. ... Table 2: Dataset Statistics Data Type Split Count Vocabulary Entities 2000 Relations 200 Atomic Triples In-Distribution (ID) 38000 Out-of-Distribution (OOD) 2000 2-hop Queries Train-II (ID ID) 273600 Test-II (ID ID) 3,000 Test-IO (ID OOD) 3,000 Test-OI (OOD ID) 3,000 Test-OO (ID OOD) 3,000
Hardware Specification	Yes	Training is conducted on NVIDIA RTX 3090 GPUs, and the maximum training duration is extended to 3 weeks to ensure stable cross-distributions generalization.
Software Dependencies	No	All experiments are implemented using the same Py Torch and Huggingface Transformers framework as in the original codebase.
Experiment Setup	Yes	The model is a decoder-only Transformer, identical in architecture to GPT-2, with 8 layers, 768 hidden dimensions, and 12 attention heads. Optimization is performed using Adam W with a learning rate of 1 10 4, 2000 warm-up steps, weight decay of 0.1, and a batch size of 1024.