reproducibilityindex.ai

Transformers, parallel computation, and logarithmic depth

Authors: Clayton Sanford, Daniel Hsu, Matus Telgarsky

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our second set of results concern the k-hop induction heads task, a synthetic sequential task that draws inspiration from the induction heads primitive of Elhage et al. (2021). The theoretical results of Section 4 prove that depth L = Θ(log k) is necessary and sufficient for efficient transformer representation. An accompanying empirical investigation reveals that transformers trained on the task obey the same threshold and recover a similar model to the theoretical construction.
Researcher Affiliation	Academia	1Department of Computer Science, Columbia University, New York, NY, USA 2Courant Institute, New York University, New York, NY, USA.
Pseudocode	Yes	Figure 1. Formal execution of an MPC protocol for computing f : Znin 2p Znout 2p .
Open Source Code	Yes	Further experimental details can be found in Appendix G.1, and the experimental code is available at https://github.com/chsanford/hop-induction-heads.
Open Datasets	No	The paper describes a synthetic task for which data is generated according to specified distributions, but no existing public dataset is used or provided for direct download.
Dataset Splits	No	The paper describes generating training samples (ntrain) and evaluating on 'n=100 samples' but does not explicitly mention distinct training, validation, and test splits with specific percentages or counts for reproduction beyond these general descriptions of usage.
Hardware Specification	Yes	All experiments were run on a 2021 Macbook Pro with an M1 chip.
Software Dependencies	No	The paper mentions using 'causally-masked GPT-2 transformers (Radford et al., 2019) from Hugging Face' and training with 'Adam (Kingma & Ba, 2014)', but no specific version numbers for these software components are provided.
Experiment Setup	Yes	Table 1. Multi-hop task hyper-parameters (Context length N 100, Alphabet size \|Σ\| 4, Max hops kmax 16). Table 2. Model and training hyper-parameters (Embedding dimension m {128, 256}, Depth L {2, 3, 4, 5, 6}, Number of heads H {4, 8}, Vocabulary size 30, Activation function Ge LU, Layer norm ϵ 10 5, Training samples ntrain 103, 3 103, Learning rate 10 4, Training steps 105, Batch size 32).