In-context Convergence of Transformers
Authors: Yu Huang, Yuan Cheng, Yingbin Liang
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 5. Experiments In this section, we conduct experiments to demonstrate that our theoretical results are consistent with the actual dynamics during the in-context training of transformers. Detailed experimental settings are deferred to Appendix B. ... Task and Data Generations. We follow the task and data distributions introduced in Section 2.1. ... Stage-Wise Convergence. In Figure 2, we plot the evolution of the prediction error for each feature throughout the training process. ... Attention Score Concentration. In Figure 3, we present the dynamic evolution of attention scores throughout the training process for both balanced and imbalanced scenarios. |
| Researcher Affiliation | Academia | 1Department of Statistics and Data Science, Wharton School, University of Pennsylvania, Philadelphia, PA, USA 2National University of Singapore, Singapore 3 Department of Electrical and Computer Engineering, The Ohio State University, Columbus, OH, USA. |
| Pseudocode | No | The training algorithm is described textually with a formula: "θ(t+1) = θ(t) η θL(θ(t))" but no formal pseudocode or algorithm block is provided. |
| Open Source Code | No | The paper does not provide any statement about releasing open-source code or a link to a code repository. |
| Open Datasets | No | We follow the task and data distributions introduced in Section 2.1. For each task, we sample the task weight w from N(0, Id d). Each data point is drawn from the given feature set {vk Rd, k = 1, , K} with probability pk for sampling vk... The paper describes how data is generated for experiments, not that a publicly available dataset is used. |
| Dataset Splits | No | The paper states: "We collect M = 300 randomly generated prompts and then train the model based on the empirical version of the training objective Equation (4) for 400 epochs". While it mentions training, it does not specify explicit training, validation, or test dataset splits or percentages. |
| Hardware Specification | No | The paper does not mention any specific hardware used for running the experiments (e.g., GPU models, CPU models, memory). |
| Software Dependencies | No | We collect M = 300 randomly generated prompts and then train the model based on the empirical version of the training objective Equation (4) for 400 epochs using Adam (Kingma & Ba, 2014) with full batch and the learning rate of 0.002. Adam is an optimizer, but no specific software versions (e.g., Python, PyTorch) are listed. |
| Experiment Setup | Yes | We collect M = 300 randomly generated prompts and then train the model based on the empirical version of the training objective Equation (4) for 400 epochs using Adam (Kingma & Ba, 2014) with full batch and the learning rate of 0.002. |