Iteration Head: A Mechanistic Study of Chain-of-Thought

Authors: Vivien Cabannes, Charles Arnal, Wassim Bouaziz, Xingyu Yang, Francois Charton, Julia Kempe

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We hypothesize that iteration heads naturally appear in transformers trained on (hard enough) iterative tasks, and verify this hypothesis in small-scale experiments. Ablation studies demonstrate the impact of the training set and choice of hyperparameters in their emergence.
Researcher Affiliation Collaboration Vivien Cabannes FAIR, Meta AI Charles Arnal Datashape, INRIA Wassim Bouaziz FAIR, Meta AI Alice Yang FAIR, Meta AI Francois Charton FAIR, Meta AI Julia Kempe Courant University and Center for Data Science, NYU & FAIR, Meta AI
Pseudocode Yes Algorithm 1 Iterative Schemes
Open Source Code Yes Our source code is available at https://github.com/facebookresearch/pal.
Open Datasets No Data was generated for the binary-copy, parity, and polynomial iteration problem with P(X, Y ) = XY + 1 in F11. For each length L from Lmin = 1 to Lmax = 32, we generated n = 1024 input sequences of length L (corresponding to a total sequence length of 2L + 3) uniformly at random for both training and testing sets, creating datasets of N = 16, 384 = 16 1024 sequences in total.
Dataset Splits No For each length L from Lmin = 1 to Lmax = 32, we generated n = 1024 input sequences of length L (corresponding to a total sequence length of 2L + 3) uniformly at random for both training and testing sets, creating datasets of N = 16, 384 = 16 1024 sequences in total.
Hardware Specification Yes Our experiments consumed 12k V100-hours.
Software Dependencies No The paper mentions 'Py Torch' [46] but does not provide a specific version number. It also mentions 'Adam' [29] which is an optimizer, not a software dependency with a version.
Experiment Setup Yes Unless otherwise stated, our experimental setup is as follows. Data was generated for the binary-copy, parity, and polynomial iteration problem with P(X, Y ) = XY + 1 in F11. For each length L from Lmin = 1 to Lmax = 32, we generated n = 1024 input sequences of length L (corresponding to a total sequence length of 2L + 3) uniformly at random for both training and testing sets, creating datasets of N = 16, 384 = 16 1024 sequences in total. We utilized auto-regressive transformers [10] with two layers and one attention head per layer. The embedding dimension was set to d = 128, with learned absolute positional encoding added to the learned token embedding. The weights were optimized over 1000 epochs with Adam [29], a batch size of 256, and a fixed learning rate set to γ = 3 10 4, with default Py Torch parameters otherwise [46].