reproducibilityindex.ai

Transformers on Markov data: Constant depth suffices

Authors: Nived Rajaraman, Marco Bondaschi, Ashok Vardhan Makkuva, Kannan Ramchandran, Michael Gastpar

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We observe a surprising phenomenon empirically which contradicts previous findings: when trained for sufficiently long, a transformer with a fixed depth and 1 head per layer is able to achieve low test loss on sequences drawn from kth-order Markov sources, even as k grows.
Researcher Affiliation	Academia	Nived Rajaraman UC Berkeley Marco Bondaschi EPFL Kannan Ramchandran UC Berkeley Michael Gastpar EPFL Ashok Vardhan Makkuva EPFL
Pseudocode	Yes	Architecture 1: Attention-only transformer. and Architecture 2: Modified transformer architecture. are structured algorithm blocks presented in the paper.
Open Source Code	Yes	Code is available at: https://github.com/Bond1995/Constant-depth-Transformers.
Open Datasets	Yes	In our experiments, we will consider kth-order Markov kernels sampled from a Dirichlet prior with parameter 1. Namely, the transition Pp \|X1 i1, , Xk ikq is sampled independently and uniformly on the S-dimensional simplex S 1 , for each tuple pi1, , ikq. Table 3: Dataset k-th order binary Markov source
Dataset Splits	No	The paper mentions 'Test loss' in Figure 3 and discusses training and testing, but does not explicitly describe validation data splits or percentages.
Hardware Specification	Yes	The experiments were run on one 8 × A100 GPU node.
Software Dependencies	No	Table 3 lists 'Optimizer Adam W' and mentions the architecture is 'Based on the GPT-2 architecture as implemented in [30]', but it does not specify version numbers for any software libraries or dependencies.
Experiment Setup	Yes	Table 3: Settings and parameters for the transformer model used in the experiments. This table provides details such as Batch size, Accumulation steps, Optimizer (Adam W with beta values), Learning rate, Scheduler, # Iterations, Weight decay, Dropout, Sequence length, Embedding dimension, Transformer layers, Attention heads, and Repetitions.