Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Multi-head Transformers Provably Learn Symbolic Multi-step Reasoning via Gradient Descent
Authors: Tong Yang, Yu Huang, Yingbin Liang, Yuejie Chi
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our theoretical analysis, grounded in the dynamics of gradient descent, shows that trained one-layer transformers can provably solve both tasks with generalization guarantees to unseen trees. In particular, our multi-phase training dynamics for forward reasoning elucidate how different attention heads learn to specialize and coordinate autonomously to solve the two subtasks in a single autoregressive path. These results provide a mechanistic explanation of how trained transformers can implement sequential algorithmic procedures. Moreover, they offer insights into the emergence of reasoning abilities, suggesting that when tasks are structured to take intermediate chain-of-thought steps, even shallow multi-head transformers can effectively solve problems that would otherwise require deeper architectures. A Numerical Experiments In this section, we provide experimental validation of our theoretical results. |
| Researcher Affiliation | Academia | Tong Yang CMU Huang Yu: UPenn Yingbin Liang; OSU Yuejie Chi Yale. Department of Electrical and Computer Engineering, Carnegie Mellon University; email: EMAIL :Department of Statistics and Data Science, Wharton School, University of Pennsylvania; email: EMAIL ;Department of Electrical and Computer Engineering, The Ohio State University; email: EMAIL Department of Statistics and Data Science, Yale University; email: EMAIL |
| Pseudocode | No | The paper describes the proposed methods using mathematical formulations and descriptive text, but does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | We do not provide open access to the data and code since the experiments are simple and are not central to the contribution. |
| Open Datasets | No | We consider a path-finding task in randomly generated trees [4]... For training...we use stochastic gradient descent with a batch size 256 on randomly generated perfect binary trees. The tree generation process is the same as described in Section 4 with m 4, S 31 for backward reasoning, and m 3, S 25 for forward reasoning. |
| Dataset Splits | Yes | The training distribution Ptrain is the uniform distribution over all perfect binary trees T p Vp T q, Ep T q, gp T qq of depth5 m with distinct nodes chosen from r Ss and a goal node gp T q being one of the leaf nodes... At test time, given any tree T (that may not be in the training set)... To validate the generalization performance, we randomly generate 1024 trees with various depths and number of nodes as the test set. |
| Hardware Specification | Yes | The experiments are simple and can be run on a single CPU. |
| Software Dependencies | No | The paper describes the construction and training of a transformer model and uses one-hot embeddings but does not specify any particular software libraries or their version numbers. |
| Experiment Setup | Yes | We construct the transformer model as in (5) and (9) for the backward and forward reasoning task, respectively. We use one-hot embedding... For training... we use stochastic gradient descent with a batch size 256... We use learning rates 1 and 0.2 for backward and forward reasoning, respectively. We initialize all parameters to be zero... |