Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Pause Tokens Strictly Increase the Expressivity of Constant-Depth Transformers

Authors: Charles London, Varun Kanade

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We provide the first formal separation result, proving that adding pause tokens to constant-depth, logarithmic-width Transformers strictly increases their computational expressivity. ... Empirically, we demonstrate that two-layer causally masked Transformers can learn parity when supplied with pause tokens, a function that they appear unable to learn without them.
Researcher Affiliation	Academia	Charles London Department of Computer Science University of Oxford Oxford, OX1 3QG EMAIL Varun Kanade Department of Computer Science University of Oxford Oxford, OX1 3QG EMAIL
Pseudocode	No	The paper describes methods and processes verbally and through figures (e.g., Figure 1: "Two layers of a Transformer with pause tokens can simulate a layer of a Boolean circuit.") but does not contain a formal pseudocode or algorithm block.
Open Source Code	Yes	Yes, code to reproduce the experimental results is provided in a .zip file, and experiment hyperparameters can be found in the appendix (and in the code).
Open Datasets	No	The datasets consist of uniformly sampled bitstrings, where the label is the parity. We generate a dataset per random seed for each length in {20, 50, 100, 150, 200, 250, 300}.
Dataset Splits	Yes	Training, validation, and test datasets consist of 500,000, 5,000, and 50,000 examples, respectively. The datasets consist of uniformly sampled bitstrings, where the label is the parity. We generate a dataset per random seed for each length in {20, 50, 100, 150, 200, 250, 300}.
Hardware Specification	Yes	All experiments were conducted using a single NVIDIA V100 GPU.
Software Dependencies	No	The paper mentions using a 'custom variant of Huggingface s GPT-2 implementation' but does not specify version numbers for any software libraries or dependencies. For example, it does not mention the Python version, PyTorch version, or CUDA version.
Experiment Setup	Yes	Our Transformer model uses 2 layers, 4 attention heads, and a hidden dimension of 32. Positional encodings are learned during training. All models are trained for 50 epochs using the Adam optimiser with a learning rate of 5 10 4, β1 = 0.9, β2 = 0.999, and no weight decay. We disable mixed precision and gradient clipping, as the models are small and training is stable.