Transformers, parallel computation, and logarithmic depth
Authors: Clayton Sanford, Daniel Hsu, Matus Telgarsky
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our second set of results concern the k-hop induction heads task, a synthetic sequential task that draws inspiration from the induction heads primitive of Elhage et al. (2021). The theoretical results of Section 4 prove that depth L = Θ(log k) is necessary and sufficient for efficient transformer representation. An accompanying empirical investigation reveals that transformers trained on the task obey the same threshold and recover a similar model to the theoretical construction. |
| Researcher Affiliation | Academia | 1Department of Computer Science, Columbia University, New York, NY, USA 2Courant Institute, New York University, New York, NY, USA. |
| Pseudocode | Yes | Figure 1. Formal execution of an MPC protocol for computing f : Znin 2p Znout 2p . |
| Open Source Code | Yes | Further experimental details can be found in Appendix G.1, and the experimental code is available at https://github.com/chsanford/hop-induction-heads. |
| Open Datasets | No | The paper describes a synthetic task for which data is generated according to specified distributions, but no existing public dataset is used or provided for direct download. |
| Dataset Splits | No | The paper describes generating training samples (ntrain) and evaluating on 'n=100 samples' but does not explicitly mention distinct training, validation, and test splits with specific percentages or counts for reproduction beyond these general descriptions of usage. |
| Hardware Specification | Yes | All experiments were run on a 2021 Macbook Pro with an M1 chip. |
| Software Dependencies | No | The paper mentions using 'causally-masked GPT-2 transformers (Radford et al., 2019) from Hugging Face' and training with 'Adam (Kingma & Ba, 2014)', but no specific version numbers for these software components are provided. |
| Experiment Setup | Yes | Table 1. Multi-hop task hyper-parameters (Context length N 100, Alphabet size |Σ| 4, Max hops kmax 16). Table 2. Model and training hyper-parameters (Embedding dimension m {128, 256}, Depth L {2, 3, 4, 5, 6}, Number of heads H {4, 8}, Vocabulary size 30, Activation function Ge LU, Layer norm ϵ 10 5, Training samples ntrain 103, 3 103, Learning rate 10 4, Training steps 105, Batch size 32). |