Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Unveiling Induction Heads: Provable Training Dynamics and Feature Learning in Transformers
Authors: Siyu Chen, Heejune Sheen, Tianhao Wang, Zhuoran Yang
NeurIPS 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We prove that the gradient flow with respect to a cross-entropy ICL loss converges to a limiting model that performs a generalized version of the induction head mechanism with a learned feature, resulting from the congruous contribution of all the building blocks. In the limiting model, the first attention layer acts as a copier, copying past tokens within a given window to each position, and the feed-forward network with normalization acts as a selector that generates a feature vector by only looking at informationally relevant parents from the window. Finally, the second attention layer is a classifier that compares these features with the feature at the output position, and uses the resulting similarity scores to generate the desired output. Our theory is further validated by simulation experiments. |
| Researcher Affiliation | Academia | Siyu Chen Department of Statistics and Data Science, Yale University EMAIL; Heejune Sheen Department of Statistics and Data Science, Yale University EMAIL; Tianhao Wang Toyota Technological Institute at Chicago EMAIL; Zhuoran Yang Department of Statistics and Data Science, Yale University EMAIL |
| Pseudocode | No | The paper describes the operations of the transformer model in text and equations but does not present them in a clearly labeled pseudocode or algorithm block. |
| Open Source Code | No | Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [No] Justification: We do not release the data and code, but the details provided in B are sufficient for reproducing the synthetic data and the experiment results. |
| Open Datasets | No | The dataset for the ICL task is generated as n-gram Markov chains as described in 2.1. We randomly sample 10,000 Markov chains with L = 100 from the prior distribution P... |
| Dataset Splits | Yes | We randomly sample 10,000 Markov chains with L = 100 from the prior distribution P; 9,000 are used for training and 1,000 for validation. |
| Hardware Specification | Yes | All experiments are conducted using a single Nvidia A100 GPU. |
| Software Dependencies | No | The paper mentions training with 'gradient descent' and 'cross-entropy loss' but does not specify software libraries, frameworks (e.g., PyTorch, TensorFlow), or their version numbers. |
| Experiment Setup | Yes | Model initialization. The RPE weight matrix W (h) P is initialized such that the ( i)-th diagonal of W (h) P has value w(h) i for i = 1, 2, . . . , M, while all other entries are initialized to . We initialize w(h) h = 3 and set the remaining entries within the size-M window to 0.01 to ensure symmetrization-breaking and some initial correspondence between heads and parents. For the FFN layer that learns the polynomial features, all c S for S [H] D are initialized to 0.01. The initial value of a in the second attention layer is set to 0.01. Training settings. The models are trained using gradient descent with respect to the cross-entropy loss and a constant learning rate that is set to one for all stages. We train the model in Stage I (update parameters {c S} only) for 2000 epochs, in Stage II (update parameters {w(h)} only) for 50,000 epochs, and in Stage III (update parameter a only) for 5000 epochs, respectively. |