In-context Convergence of Transformers

Authors: Yu Huang, Yuan Cheng, Yingbin Liang

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 5. Experiments In this section, we conduct experiments to demonstrate that our theoretical results are consistent with the actual dynamics during the in-context training of transformers. Detailed experimental settings are deferred to Appendix B. ... Task and Data Generations. We follow the task and data distributions introduced in Section 2.1. ... Stage-Wise Convergence. In Figure 2, we plot the evolution of the prediction error for each feature throughout the training process. ... Attention Score Concentration. In Figure 3, we present the dynamic evolution of attention scores throughout the training process for both balanced and imbalanced scenarios.
Researcher Affiliation Academia 1Department of Statistics and Data Science, Wharton School, University of Pennsylvania, Philadelphia, PA, USA 2National University of Singapore, Singapore 3 Department of Electrical and Computer Engineering, The Ohio State University, Columbus, OH, USA.
Pseudocode No The training algorithm is described textually with a formula: "θ(t+1) = θ(t) η θL(θ(t))" but no formal pseudocode or algorithm block is provided.
Open Source Code No The paper does not provide any statement about releasing open-source code or a link to a code repository.
Open Datasets No We follow the task and data distributions introduced in Section 2.1. For each task, we sample the task weight w from N(0, Id d). Each data point is drawn from the given feature set {vk Rd, k = 1, , K} with probability pk for sampling vk... The paper describes how data is generated for experiments, not that a publicly available dataset is used.
Dataset Splits No The paper states: "We collect M = 300 randomly generated prompts and then train the model based on the empirical version of the training objective Equation (4) for 400 epochs". While it mentions training, it does not specify explicit training, validation, or test dataset splits or percentages.
Hardware Specification No The paper does not mention any specific hardware used for running the experiments (e.g., GPU models, CPU models, memory).
Software Dependencies No We collect M = 300 randomly generated prompts and then train the model based on the empirical version of the training objective Equation (4) for 400 epochs using Adam (Kingma & Ba, 2014) with full batch and the learning rate of 0.002. Adam is an optimizer, but no specific software versions (e.g., Python, PyTorch) are listed.
Experiment Setup Yes We collect M = 300 randomly generated prompts and then train the model based on the empirical version of the training objective Equation (4) for 400 epochs using Adam (Kingma & Ba, 2014) with full batch and the learning rate of 0.002.