Transformers on Markov data: Constant depth suffices
Authors: Nived Rajaraman, Marco Bondaschi, Ashok Vardhan Makkuva, Kannan Ramchandran, Michael Gastpar
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We observe a surprising phenomenon empirically which contradicts previous findings: when trained for sufficiently long, a transformer with a fixed depth and 1 head per layer is able to achieve low test loss on sequences drawn from kth-order Markov sources, even as k grows. |
| Researcher Affiliation | Academia | Nived Rajaraman UC Berkeley Marco Bondaschi EPFL Kannan Ramchandran UC Berkeley Michael Gastpar EPFL Ashok Vardhan Makkuva EPFL |
| Pseudocode | Yes | Architecture 1: Attention-only transformer. and Architecture 2: Modified transformer architecture. are structured algorithm blocks presented in the paper. |
| Open Source Code | Yes | Code is available at: https://github.com/Bond1995/Constant-depth-Transformers. |
| Open Datasets | Yes | In our experiments, we will consider kth-order Markov kernels sampled from a Dirichlet prior with parameter 1. Namely, the transition Pp |X1 i1, , Xk ikq is sampled independently and uniformly on the S-dimensional simplex S 1 , for each tuple pi1, , ikq. Table 3: Dataset k-th order binary Markov source |
| Dataset Splits | No | The paper mentions 'Test loss' in Figure 3 and discusses training and testing, but does not explicitly describe validation data splits or percentages. |
| Hardware Specification | Yes | The experiments were run on one 8 × A100 GPU node. |
| Software Dependencies | No | Table 3 lists 'Optimizer Adam W' and mentions the architecture is 'Based on the GPT-2 architecture as implemented in [30]', but it does not specify version numbers for any software libraries or dependencies. |
| Experiment Setup | Yes | Table 3: Settings and parameters for the transformer model used in the experiments. This table provides details such as Batch size, Accumulation steps, Optimizer (Adam W with beta values), Learning rate, Scheduler, # Iterations, Weight decay, Dropout, Sequence length, Embedding dimension, Transformer layers, Attention heads, and Repetitions. |