Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Transformers on Markov data: Constant depth suffices
Authors: Nived Rajaraman, Marco Bondaschi, Ashok Vardhan Makkuva, Kannan Ramchandran, Michael Gastpar
NeurIPS 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We observe a surprising phenomenon empirically which contradicts previous findings: when trained for sufficiently long, a transformer with a fixed depth and 1 head per layer is able to achieve low test loss on sequences drawn from kth-order Markov sources, even as k grows. |
| Researcher Affiliation | Academia | Nived Rajaraman UC Berkeley Marco Bondaschi EPFL Kannan Ramchandran UC Berkeley Michael Gastpar EPFL Ashok Vardhan Makkuva EPFL |
| Pseudocode | Yes | Architecture 1: Attention-only transformer. and Architecture 2: Modified transformer architecture. are structured algorithm blocks presented in the paper. |
| Open Source Code | Yes | Code is available at: https://github.com/Bond1995/Constant-depth-Transformers. |
| Open Datasets | Yes | In our experiments, we will consider kth-order Markov kernels sampled from a Dirichlet prior with parameter 1. Namely, the transition Pp |X1 i1, , Xk ikq is sampled independently and uniformly on the S-dimensional simplex S 1 , for each tuple pi1, , ikq. Table 3: Dataset k-th order binary Markov source |
| Dataset Splits | No | The paper mentions 'Test loss' in Figure 3 and discusses training and testing, but does not explicitly describe validation data splits or percentages. |
| Hardware Specification | Yes | The experiments were run on one 8 × A100 GPU node. |
| Software Dependencies | No | Table 3 lists 'Optimizer Adam W' and mentions the architecture is 'Based on the GPT-2 architecture as implemented in [30]', but it does not specify version numbers for any software libraries or dependencies. |
| Experiment Setup | Yes | Table 3: Settings and parameters for the transformer model used in the experiments. This table provides details such as Batch size, Accumulation steps, Optimizer (Adam W with beta values), Learning rate, Scheduler, # Iterations, Weight decay, Dropout, Sequence length, Embedding dimension, Transformer layers, Attention heads, and Repetitions. |