Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Transformers Learn to Implement Multi-step Gradient Descent with Chain of Thought
Authors: Jianhao Huang, Zixuan Wang, Jason Lee
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, we demonstrate that Co T prompting yields substantial performance improvements. Section 5 empirically validates the advantage of Co T. We empirically validate that the trained transformer converges to the minimizer predicted by our theory, with a distinct performance gap between models trained with and without Co T prompting. Our experiments show that the structure that weights of the full model exhibit is consistent with Theorem 3.2. Our experiments in Figure 2 demonstrate that the evaluation loss of transformers with Co T converges to near zero even when k = 10. We empirically verify the OOD generalization result shown by Theorem 4.2. Our experiment in Figure 3 exhibits that the OOD loss of transformers with Co T converges to near zero when k = 10, 20, 30, 40 as the training loss/in-distribution loss converges to zero. |
| Researcher Affiliation | Academia | 1Shanghai Jiaotong University, 2Princeton University huang EMAIL, EMAIL |
| Pseudocode | No | The paper describes the model architecture and processes using mathematical formulas and natural language. There are no explicitly labeled "Pseudocode" or "Algorithm" blocks, nor structured steps formatted like code. |
| Open Source Code | No | The paper does not explicitly state that source code is provided, nor does it include any links to code repositories or mention code in supplementary materials. |
| Open Datasets | No | The paper uses synthetic data generated according to the process described in Equation (1): 'w N(0, Id) xi N(0, Id) yi = w xi for all i [n].' While the generation process is fully described, there is no external link, DOI, or specific repository provided for a pre-existing dataset. |
| Dataset Splits | No | The paper uses synthetic data and describes parameters for its generation (e.g., 'token dimensions d = 10, number of in-context examples n = 20') and training (e.g., 'batch size B = 1000'), but it does not specify explicit training/test/validation dataset splits in the conventional sense, as data appears to be generated per experiment or batch. |
| Hardware Specification | Yes | For all our experiments, we use pytorch Paszke et al. (2019) and models are trained on an NVIDIA RTX A6000. |
| Software Dependencies | No | The paper mentions 'pytorch Paszke et al. (2019)' but does not provide a specific version number for PyTorch or any other software dependencies. |
| Experiment Setup | Yes | In particular, we choose the token dimensions d = 10, number of in-context examples n = 20, and GD learning rate η = 0.4 for generating the ground-truth intermediate states. We use a batch size B = 1000 and run Adam with learning rate α = 0.001 for τ = 750 iterations. |