Transformers, parallel computation, and logarithmic depth

Authors: Clayton Sanford, Daniel Hsu, Matus Telgarsky

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our second set of results concern the k-hop induction heads task, a synthetic sequential task that draws inspiration from the induction heads primitive of Elhage et al. (2021). The theoretical results of Section 4 prove that depth L = Θ(log k) is necessary and sufficient for efficient transformer representation. An accompanying empirical investigation reveals that transformers trained on the task obey the same threshold and recover a similar model to the theoretical construction.
Researcher Affiliation Academia 1Department of Computer Science, Columbia University, New York, NY, USA 2Courant Institute, New York University, New York, NY, USA.
Pseudocode Yes Figure 1. Formal execution of an MPC protocol for computing f : Znin 2p Znout 2p .
Open Source Code Yes Further experimental details can be found in Appendix G.1, and the experimental code is available at https://github.com/chsanford/hop-induction-heads.
Open Datasets No The paper describes a synthetic task for which data is generated according to specified distributions, but no existing public dataset is used or provided for direct download.
Dataset Splits No The paper describes generating training samples (ntrain) and evaluating on 'n=100 samples' but does not explicitly mention distinct training, validation, and test splits with specific percentages or counts for reproduction beyond these general descriptions of usage.
Hardware Specification Yes All experiments were run on a 2021 Macbook Pro with an M1 chip.
Software Dependencies No The paper mentions using 'causally-masked GPT-2 transformers (Radford et al., 2019) from Hugging Face' and training with 'Adam (Kingma & Ba, 2014)', but no specific version numbers for these software components are provided.
Experiment Setup Yes Table 1. Multi-hop task hyper-parameters (Context length N 100, Alphabet size |Σ| 4, Max hops kmax 16). Table 2. Model and training hyper-parameters (Embedding dimension m {128, 256}, Depth L {2, 3, 4, 5, 6}, Number of heads H {4, 8}, Vocabulary size 30, Activation function Ge LU, Layer norm ϵ 10 5, Training samples ntrain 103, 3 103, Learning rate 10 4, Training steps 105, Batch size 32).