How Do Nonlinear Transformers Learn and Generalize in In-Context Learning?
Authors: Hongkang Li, Meng Wang, Songtao Lu, Xiaodong Cui, Pin-Yu Chen
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To the best of our knowledge, this paper provides the first theoretical analysis of the training dynamics of Transformers with nonlinear self-attention and nonlinear MLP, together with the ICL generalization capability of the resulting model. Focusing on a group of binary classification tasks, we train Transformers using data from a subset of these tasks and quantify the impact of various factors on the ICL generalization performance on the remaining unseen tasks with and without data distribution shifts. We also analyze how different components in the learned Transformers contribute to the ICL performance. Furthermore, we provide the first theoretical analysis of how model pruning affects ICL performance and prove that proper magnitude-based pruning can have a minimal impact on ICL while reducing inference costs. These theoretical findings are justified through numerical experiments. |
| Researcher Affiliation | Collaboration | 1Department of Electrical, Computer, and System Engineering, Rensselaer Polytechnic Institute, Troy, NY, USA 2IBM Thomas J. Watson Research Center, Yorktown Heights, NY, USA. Correspondence to: Hongkang Li <lih35@rpi.edu>, Meng Wang <wangm7@rpi.edu>, Songtao Lu <songtao@ibm.com>, Xiaodong Cui <cuix@us.ibm.com>, Pin-Yu Chen <pinyu.chen@ibm.com>. |
| Pseudocode | Yes | The model is trained using stochastic gradient descent (SGD) with step size η with batch size B, summarized in Algorithm 1 in Appendix C. ... Algorithm 1 Training with Stochastic Gradient Descent (SGD) |
| Open Source Code | No | The paper does not contain an explicit statement about releasing source code or provide a link to a code repository. |
| Open Datasets | No | Data Generation We verify our theoretical findings using data generated as described in Section 2. ... The in-context binary classification error is evaluated by E(x,y)[Pr(y F(Ψ; P ) < 0)] for x following either D or D and P constructed in (1). ... This synthetic dataset, while based on the paper's formulation, is not provided with public access information. |
| Dataset Splits | No | The paper mentions "training data" and "testing queries" but does not specify explicit training/validation/test dataset splits (e.g., percentages or sample counts). |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., exact GPU/CPU models, memory amounts, or detailed computer specifications) used for running its experiments. |
| Software Dependencies | No | The paper does not list specific software components with their version numbers (e.g., Python 3.x, PyTorch 1.x) to describe ancillary software dependencies. |
| Experiment Setup | Yes | Model and Training Setup: ... If not otherwise specified, we set α = 0.8, ltr = 20 for training. ... For the one-layer Transformer, we use U = 1 and ma = mb = 60. ... Algorithm 1 Training with Stochastic Gradient Descent (SGD) ... Hyperparameters: The step size η, the number of iterations T, batch size B. ... Initialization: Each entry of W (0) O and a(0) from N(0, ξ2) and Uniform({+1/ m, 1/ m}), respectively. WQ, WK and WV are initialized such that all diagonal entries of W (0) V , and the first d X diagonal entries of W (0) Q and W (0) K are set as δ with δ (0, 0.2]. |