reproducibilityindex.ai

Iteration Head: A Mechanistic Study of Chain-of-Thought

Authors: Vivien Cabannes, Charles Arnal, Wassim Bouaziz, Xingyu Yang, Francois Charton, Julia Kempe

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We hypothesize that iteration heads naturally appear in transformers trained on (hard enough) iterative tasks, and verify this hypothesis in small-scale experiments. Ablation studies demonstrate the impact of the training set and choice of hyperparameters in their emergence.
Researcher Affiliation	Collaboration	Vivien Cabannes FAIR, Meta AI Charles Arnal Datashape, INRIA Wassim Bouaziz FAIR, Meta AI Alice Yang FAIR, Meta AI Francois Charton FAIR, Meta AI Julia Kempe Courant University and Center for Data Science, NYU & FAIR, Meta AI
Pseudocode	Yes	Algorithm 1 Iterative Schemes
Open Source Code	Yes	Our source code is available at https://github.com/facebookresearch/pal.
Open Datasets	No	Data was generated for the binary-copy, parity, and polynomial iteration problem with P(X, Y ) = XY + 1 in F11. For each length L from Lmin = 1 to Lmax = 32, we generated n = 1024 input sequences of length L (corresponding to a total sequence length of 2L + 3) uniformly at random for both training and testing sets, creating datasets of N = 16, 384 = 16 1024 sequences in total.
Dataset Splits	No	For each length L from Lmin = 1 to Lmax = 32, we generated n = 1024 input sequences of length L (corresponding to a total sequence length of 2L + 3) uniformly at random for both training and testing sets, creating datasets of N = 16, 384 = 16 1024 sequences in total.
Hardware Specification	Yes	Our experiments consumed 12k V100-hours.
Software Dependencies	No	The paper mentions 'Py Torch' [46] but does not provide a specific version number. It also mentions 'Adam' [29] which is an optimizer, not a software dependency with a version.
Experiment Setup	Yes	Unless otherwise stated, our experimental setup is as follows. Data was generated for the binary-copy, parity, and polynomial iteration problem with P(X, Y ) = XY + 1 in F11. For each length L from Lmin = 1 to Lmax = 32, we generated n = 1024 input sequences of length L (corresponding to a total sequence length of 2L + 3) uniformly at random for both training and testing sets, creating datasets of N = 16, 384 = 16 1024 sequences in total. We utilized auto-regressive transformers [10] with two layers and one attention head per layer. The embedding dimension was set to d = 128, with learned absolute positional encoding added to the learned token embedding. The weights were optimized over 1000 epochs with Adam [29], a batch size of 256, and a fixed learning rate set to γ = 3 10 4, with default Py Torch parameters otherwise [46].