Iteration Head: A Mechanistic Study of Chain-of-Thought
Authors: Vivien Cabannes, Charles Arnal, Wassim Bouaziz, Xingyu Yang, Francois Charton, Julia Kempe
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We hypothesize that iteration heads naturally appear in transformers trained on (hard enough) iterative tasks, and verify this hypothesis in small-scale experiments. Ablation studies demonstrate the impact of the training set and choice of hyperparameters in their emergence. |
| Researcher Affiliation | Collaboration | Vivien Cabannes FAIR, Meta AI Charles Arnal Datashape, INRIA Wassim Bouaziz FAIR, Meta AI Alice Yang FAIR, Meta AI Francois Charton FAIR, Meta AI Julia Kempe Courant University and Center for Data Science, NYU & FAIR, Meta AI |
| Pseudocode | Yes | Algorithm 1 Iterative Schemes |
| Open Source Code | Yes | Our source code is available at https://github.com/facebookresearch/pal. |
| Open Datasets | No | Data was generated for the binary-copy, parity, and polynomial iteration problem with P(X, Y ) = XY + 1 in F11. For each length L from Lmin = 1 to Lmax = 32, we generated n = 1024 input sequences of length L (corresponding to a total sequence length of 2L + 3) uniformly at random for both training and testing sets, creating datasets of N = 16, 384 = 16 1024 sequences in total. |
| Dataset Splits | No | For each length L from Lmin = 1 to Lmax = 32, we generated n = 1024 input sequences of length L (corresponding to a total sequence length of 2L + 3) uniformly at random for both training and testing sets, creating datasets of N = 16, 384 = 16 1024 sequences in total. |
| Hardware Specification | Yes | Our experiments consumed 12k V100-hours. |
| Software Dependencies | No | The paper mentions 'Py Torch' [46] but does not provide a specific version number. It also mentions 'Adam' [29] which is an optimizer, not a software dependency with a version. |
| Experiment Setup | Yes | Unless otherwise stated, our experimental setup is as follows. Data was generated for the binary-copy, parity, and polynomial iteration problem with P(X, Y ) = XY + 1 in F11. For each length L from Lmin = 1 to Lmax = 32, we generated n = 1024 input sequences of length L (corresponding to a total sequence length of 2L + 3) uniformly at random for both training and testing sets, creating datasets of N = 16, 384 = 16 1024 sequences in total. We utilized auto-regressive transformers [10] with two layers and one attention head per layer. The embedding dimension was set to d = 128, with learned absolute positional encoding added to the learned token embedding. The weights were optimized over 1000 epochs with Adam [29], a batch size of 256, and a fixed learning rate set to γ = 3 10 4, with default Py Torch parameters otherwise [46]. |