Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Transformers Learn to Implement Multi-step Gradient Descent with Chain of Thought

Authors: Jianhao Huang, Zixuan Wang, Jason Lee

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, we demonstrate that Co T prompting yields substantial performance improvements. Section 5 empirically validates the advantage of Co T. We empirically validate that the trained transformer converges to the minimizer predicted by our theory, with a distinct performance gap between models trained with and without Co T prompting. Our experiments show that the structure that weights of the full model exhibit is consistent with Theorem 3.2. Our experiments in Figure 2 demonstrate that the evaluation loss of transformers with Co T converges to near zero even when k = 10. We empirically verify the OOD generalization result shown by Theorem 4.2. Our experiment in Figure 3 exhibits that the OOD loss of transformers with Co T converges to near zero when k = 10, 20, 30, 40 as the training loss/in-distribution loss converges to zero.
Researcher Affiliation Academia 1Shanghai Jiaotong University, 2Princeton University huang EMAIL, EMAIL
Pseudocode No The paper describes the model architecture and processes using mathematical formulas and natural language. There are no explicitly labeled "Pseudocode" or "Algorithm" blocks, nor structured steps formatted like code.
Open Source Code No The paper does not explicitly state that source code is provided, nor does it include any links to code repositories or mention code in supplementary materials.
Open Datasets No The paper uses synthetic data generated according to the process described in Equation (1): 'w N(0, Id) xi N(0, Id) yi = w xi for all i [n].' While the generation process is fully described, there is no external link, DOI, or specific repository provided for a pre-existing dataset.
Dataset Splits No The paper uses synthetic data and describes parameters for its generation (e.g., 'token dimensions d = 10, number of in-context examples n = 20') and training (e.g., 'batch size B = 1000'), but it does not specify explicit training/test/validation dataset splits in the conventional sense, as data appears to be generated per experiment or batch.
Hardware Specification Yes For all our experiments, we use pytorch Paszke et al. (2019) and models are trained on an NVIDIA RTX A6000.
Software Dependencies No The paper mentions 'pytorch Paszke et al. (2019)' but does not provide a specific version number for PyTorch or any other software dependencies.
Experiment Setup Yes In particular, we choose the token dimensions d = 10, number of in-context examples n = 20, and GD learning rate η = 0.4 for generating the ground-truth intermediate states. We use a batch size B = 1000 and run Adam with learning rate α = 0.001 for τ = 750 iterations.