Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Transformers Learn to Implement Multi-step Gradient Descent with Chain of Thought
Authors: Jianhao Huang, Zixuan Wang, Jason Lee
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, we demonstrate that Co T prompting yields substantial performance improvements. Section 5 empirically validates the advantage of Co T. We empirically validate that the trained transformer converges to the minimizer predicted by our theory, with a distinct performance gap between models trained with and without Co T prompting. Our experiments show that the structure that weights of the full model exhibit is consistent with Theorem 3.2. Our experiments in Figure 2 demonstrate that the evaluation loss of transformers with Co T converges to near zero even when k = 10. We empirically verify the OOD generalization result shown by Theorem 4.2. Our experiment in Figure 3 exhibits that the OOD loss of transformers with Co T converges to near zero when k = 10, 20, 30, 40 as the training loss/in-distribution loss converges to zero. |
| Researcher Affiliation | Academia | 1Shanghai Jiaotong University, 2Princeton University huang EMAIL, EMAIL |
| Pseudocode | No | The paper describes the model architecture and processes using mathematical formulas and natural language. There are no explicitly labeled "Pseudocode" or "Algorithm" blocks, nor structured steps formatted like code. |
| Open Source Code | No | The paper does not explicitly state that source code is provided, nor does it include any links to code repositories or mention code in supplementary materials. |
| Open Datasets | No | The paper uses synthetic data generated according to the process described in Equation (1): 'w N(0, Id) xi N(0, Id) yi = w xi for all i [n].' While the generation process is fully described, there is no external link, DOI, or specific repository provided for a pre-existing dataset. |
| Dataset Splits | No | The paper uses synthetic data and describes parameters for its generation (e.g., 'token dimensions d = 10, number of in-context examples n = 20') and training (e.g., 'batch size B = 1000'), but it does not specify explicit training/test/validation dataset splits in the conventional sense, as data appears to be generated per experiment or batch. |
| Hardware Specification | Yes | For all our experiments, we use pytorch Paszke et al. (2019) and models are trained on an NVIDIA RTX A6000. |
| Software Dependencies | No | The paper mentions 'pytorch Paszke et al. (2019)' but does not provide a specific version number for PyTorch or any other software dependencies. |
| Experiment Setup | Yes | In particular, we choose the token dimensions d = 10, number of in-context examples n = 20, and GD learning rate η = 0.4 for generating the ground-truth intermediate states. We use a batch size B = 1000 and run Adam with learning rate α = 0.001 for τ = 750 iterations. |