reproducibilityindex.ai

How Do Nonlinear Transformers Learn and Generalize in In-Context Learning?

Authors: Hongkang Li, Meng Wang, Songtao Lu, Xiaodong Cui, Pin-Yu Chen

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To the best of our knowledge, this paper provides the first theoretical analysis of the training dynamics of Transformers with nonlinear self-attention and nonlinear MLP, together with the ICL generalization capability of the resulting model. Focusing on a group of binary classification tasks, we train Transformers using data from a subset of these tasks and quantify the impact of various factors on the ICL generalization performance on the remaining unseen tasks with and without data distribution shifts. We also analyze how different components in the learned Transformers contribute to the ICL performance. Furthermore, we provide the first theoretical analysis of how model pruning affects ICL performance and prove that proper magnitude-based pruning can have a minimal impact on ICL while reducing inference costs. These theoretical findings are justified through numerical experiments.
Researcher Affiliation	Collaboration	1Department of Electrical, Computer, and System Engineering, Rensselaer Polytechnic Institute, Troy, NY, USA 2IBM Thomas J. Watson Research Center, Yorktown Heights, NY, USA. Correspondence to: Hongkang Li <lih35@rpi.edu>, Meng Wang <wangm7@rpi.edu>, Songtao Lu <songtao@ibm.com>, Xiaodong Cui <cuix@us.ibm.com>, Pin-Yu Chen <pinyu.chen@ibm.com>.
Pseudocode	Yes	The model is trained using stochastic gradient descent (SGD) with step size η with batch size B, summarized in Algorithm 1 in Appendix C. ... Algorithm 1 Training with Stochastic Gradient Descent (SGD)
Open Source Code	No	The paper does not contain an explicit statement about releasing source code or provide a link to a code repository.
Open Datasets	No	Data Generation We verify our theoretical findings using data generated as described in Section 2. ... The in-context binary classification error is evaluated by E(x,y)[Pr(y F(Ψ; P ) < 0)] for x following either D or D and P constructed in (1). ... This synthetic dataset, while based on the paper's formulation, is not provided with public access information.
Dataset Splits	No	The paper mentions "training data" and "testing queries" but does not specify explicit training/validation/test dataset splits (e.g., percentages or sample counts).
Hardware Specification	No	The paper does not provide specific hardware details (e.g., exact GPU/CPU models, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies	No	The paper does not list specific software components with their version numbers (e.g., Python 3.x, PyTorch 1.x) to describe ancillary software dependencies.
Experiment Setup	Yes	Model and Training Setup: ... If not otherwise specified, we set α = 0.8, ltr = 20 for training. ... For the one-layer Transformer, we use U = 1 and ma = mb = 60. ... Algorithm 1 Training with Stochastic Gradient Descent (SGD) ... Hyperparameters: The step size η, the number of iterations T, batch size B. ... Initialization: Each entry of W (0) O and a(0) from N(0, ξ2) and Uniform({+1/ m, 1/ m}), respectively. WQ, WK and WV are initialized such that all diagonal entries of W (0) V , and the first d X diagonal entries of W (0) Q and W (0) K are set as δ with δ (0, 0.2].