Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
How do Transformers Learn Implicit Reasoning?
Authors: Jiaran Ye, Zijun Yao, Zhidian Huang, Liangming Pan, Jinxin Liu, Yushi Bai, Amy Xin, Liu Weichuan, Xiaoyin Che, Lei Hou, Juanzi Li
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our empirical analysis begins with a behavioral study conducted under fine-grained experimental control (Section 2). Under a complete training configuration, we observe that multi-hop implicit reasoning emerges in three distinct stages: memorization, in-distribution generalization, and finally cross-distribution generalization. Through ablation studies, we further demonstrate that while exposure to in-distribution (ID) triples is not strictly necessary for achieving in-distribution generalization, its absence significantly delays the onset of this behavior. |
| Researcher Affiliation | Collaboration | DCST, BNRist; KIRC, Institute for Artificial Intelligence, Tsinghua University, China MOE Key Lab of Computational Linguistics, Peking University, China Siemens AG, China |
| Pseudocode | No | The paper describes methods and procedures in paragraph text and figures, but it does not contain a clearly labeled 'Pseudocode' or 'Algorithm' block, nor structured steps formatted like code. |
| Open Source Code | Yes | https://github.com/Jiaran-Ye/Implicit Reasoning |
| Open Datasets | No | Our data construction pipeline is adapted from the open-source code released by Wang et al. [25], with modifications to support fine-grained query-level control. In all configurations, we construct a symbolic environment consisting of 2000 entities and 200 relations, each assigned a unique token with no inherent semantics. |
| Dataset Splits | Yes | We construct the training set with a 7.2:1 ratio of Train-II queries to ID atomic triples to ensure compositional supervision dominates, and sample a fixed test set of 3,000 examples for each type. Table 2 summarizes the key dataset statistics. ... Table 2: Dataset Statistics Data Type Split Count Vocabulary Entities 2000 Relations 200 Atomic Triples In-Distribution (ID) 38000 Out-of-Distribution (OOD) 2000 2-hop Queries Train-II (ID ID) 273600 Test-II (ID ID) 3,000 Test-IO (ID OOD) 3,000 Test-OI (OOD ID) 3,000 Test-OO (ID OOD) 3,000 |
| Hardware Specification | Yes | Training is conducted on NVIDIA RTX 3090 GPUs, and the maximum training duration is extended to 3 weeks to ensure stable cross-distributions generalization. |
| Software Dependencies | No | All experiments are implemented using the same Py Torch and Huggingface Transformers framework as in the original codebase. |
| Experiment Setup | Yes | The model is a decoder-only Transformer, identical in architecture to GPT-2, with 8 layers, 768 hidden dimensions, and 12 attention heads. Optimization is performed using Adam W with a learning rate of 1 10 4, 2000 warm-up steps, weight decay of 0.1, and a batch size of 1024. |