Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Transformers Provably Learn Chain-of-Thought Reasoning with Length Generalization
Authors: Yu Huang, Zixin Wen, Aarti Singh, Yuejie Chi, Yuxin Chen
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Finally, we present a broad set of experiments supporting our theoretical results, confirming the length generalization behaviors and the mechanism of attention concentration. ... A Experiments In this section, we conduct synthetic experiments to verify our theoretical claim. |
| Researcher Affiliation | Academia | Yu Huang Upenn Zixin Wen CMU Aarti Singh CMU Yuejie Chi Yale Yuxin Chen Upenn |
| Pseudocode | Yes | Algorithm 1: Curriculum training for simply transitive actions ... Algorithm 2: Recursive self-training for symmetry actions |
| Open Source Code | No | While the main contribution of this work is theoretical, we provide experimental details in Appendix A to empirically verify and support our theoretical findings. ... We provide details in Appendix A to enable reproduction of our synthetic experiments. |
| Open Datasets | Yes | We study this model on synthetic state-tracking tasks, namely LEGO [45], which distill core LLM skills such as entity tracking, game-state updates, and code evaluation [64, 65]. ... [45] Yi Zhang, Arturs Backurs, Sรฉbastien Bubeck, Ronen Eldan, Suriya Gunasekar, and Tal Wagner. Unveiling transformers with lego: a synthetic reasoning task. ar Xiv preprint ar Xiv:2206.04301, 2022. |
| Dataset Splits | Yes | For any L < L, we define the truncated distribution DL,L of sequences ZL,L containing all the predicates and the first L + 1 many answer clauses, where ZL,L is obtained by first sampling ZL DL and then removing the answer clauses Zans,โfor all โ> L . ... We first train at L = 5 with ground-truth answer supervision until convergence. At the next stage, we double the length to L = 10, use the L=5 model to greedily generate answer traces (self-labels) for the L=10 data, and retrain on these pseudo-labels. We repeat this doubling process for three stages (L = 5 โ 10 โ 20 โ 40), so that the final model is trained on self-labeled data at L = 40. |
| Hardware Specification | No | The experiments are small-scale synthetic and can be reproduced on a single GPU; we include the necessary implementation and hyperparameter details, and do not require specialized compute resources. |
| Software Dependencies | No | Training optimizes the next-clause loss in (6a) using Adam [94] with a learning rate of 1e-4. |
| Experiment Setup | Yes | Training optimizes the next-clause loss in (6a) using Adam [94] with a learning rate of 1e-4. We train for 300 epochs to ensure the training loss approaches zero and the model converges. |