Unveiling Induction Heads: Provable Training Dynamics and Feature Learning in Transformers

Authors: Siyu Chen, Heejune Sheen, Tianhao Wang, Zhuoran Yang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We prove that the gradient flow with respect to a cross-entropy ICL loss converges to a limiting model that performs a generalized version of the induction head mechanism with a learned feature, resulting from the congruous contribution of all the building blocks. In the limiting model, the first attention layer acts as a copier, copying past tokens within a given window to each position, and the feed-forward network with normalization acts as a selector that generates a feature vector by only looking at informationally relevant parents from the window. Finally, the second attention layer is a classifier that compares these features with the feature at the output position, and uses the resulting similarity scores to generate the desired output. Our theory is further validated by simulation experiments.
Researcher Affiliation Academia Siyu Chen Department of Statistics and Data Science, Yale University siyu.chen.sc3226@yale.edu; Heejune Sheen Department of Statistics and Data Science, Yale University heejune.sheen@yale.edu; Tianhao Wang Toyota Technological Institute at Chicago tianhao.wang@ttic.edu; Zhuoran Yang Department of Statistics and Data Science, Yale University zhuoran.yang@yale.edu
Pseudocode No The paper describes the operations of the transformer model in text and equations but does not present them in a clearly labeled pseudocode or algorithm block.
Open Source Code No Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [No] Justification: We do not release the data and code, but the details provided in B are sufficient for reproducing the synthetic data and the experiment results.
Open Datasets No The dataset for the ICL task is generated as n-gram Markov chains as described in 2.1. We randomly sample 10,000 Markov chains with L = 100 from the prior distribution P...
Dataset Splits Yes We randomly sample 10,000 Markov chains with L = 100 from the prior distribution P; 9,000 are used for training and 1,000 for validation.
Hardware Specification Yes All experiments are conducted using a single Nvidia A100 GPU.
Software Dependencies No The paper mentions training with 'gradient descent' and 'cross-entropy loss' but does not specify software libraries, frameworks (e.g., PyTorch, TensorFlow), or their version numbers.
Experiment Setup Yes Model initialization. The RPE weight matrix W (h) P is initialized such that the ( i)-th diagonal of W (h) P has value w(h) i for i = 1, 2, . . . , M, while all other entries are initialized to . We initialize w(h) h = 3 and set the remaining entries within the size-M window to 0.01 to ensure symmetrization-breaking and some initial correspondence between heads and parents. For the FFN layer that learns the polynomial features, all c S for S [H] D are initialized to 0.01. The initial value of a in the second attention layer is set to 0.01. Training settings. The models are trained using gradient descent with respect to the cross-entropy loss and a constant learning rate that is set to one for all stages. We train the model in Stage I (update parameters {c S} only) for 2000 epochs, in Stage II (update parameters {w(h)} only) for 50,000 epochs, and in Stage III (update parameter a only) for 5000 epochs, respectively.