Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Transformers Provably Learn Chain-of-Thought Reasoning with Length Generalization

Authors: Yu Huang, Zixin Wen, Aarti Singh, Yuejie Chi, Yuxin Chen

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Finally, we present a broad set of experiments supporting our theoretical results, confirming the length generalization behaviors and the mechanism of attention concentration. ... A Experiments In this section, we conduct synthetic experiments to verify our theoretical claim.
Researcher Affiliation	Academia	Yu Huang Upenn Zixin Wen CMU Aarti Singh CMU Yuejie Chi Yale Yuxin Chen Upenn
Pseudocode	Yes	Algorithm 1: Curriculum training for simply transitive actions ... Algorithm 2: Recursive self-training for symmetry actions
Open Source Code	No	While the main contribution of this work is theoretical, we provide experimental details in Appendix A to empirically verify and support our theoretical findings. ... We provide details in Appendix A to enable reproduction of our synthetic experiments.
Open Datasets	Yes	We study this model on synthetic state-tracking tasks, namely LEGO [45], which distill core LLM skills such as entity tracking, game-state updates, and code evaluation [64, 65]. ... [45] Yi Zhang, Arturs Backurs, Sébastien Bubeck, Ronen Eldan, Suriya Gunasekar, and Tal Wagner. Unveiling transformers with lego: a synthetic reasoning task. ar Xiv preprint ar Xiv:2206.04301, 2022.
Dataset Splits	Yes	For any L < L, we define the truncated distribution DL,L of sequences ZL,L containing all the predicates and the first L + 1 many answer clauses, where ZL,L is obtained by first sampling ZL DL and then removing the answer clauses Zans,ℓfor all ℓ> L . ... We first train at L = 5 with ground-truth answer supervision until convergence. At the next stage, we double the length to L = 10, use the L=5 model to greedily generate answer traces (self-labels) for the L=10 data, and retrain on these pseudo-labels. We repeat this doubling process for three stages (L = 5 → 10 → 20 → 40), so that the final model is trained on self-labeled data at L = 40.
Hardware Specification	No	The experiments are small-scale synthetic and can be reproduced on a single GPU; we include the necessary implementation and hyperparameter details, and do not require specialized compute resources.
Software Dependencies	No	Training optimizes the next-clause loss in (6a) using Adam [94] with a learning rate of 1e-4.
Experiment Setup	Yes	Training optimizes the next-clause loss in (6a) using Adam [94] with a learning rate of 1e-4. We train for 300 epochs to ensure the training loss approaches zero and the model converges.