Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Pause Tokens Strictly Increase the Expressivity of Constant-Depth Transformers

Authors: Charles London, Varun Kanade

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We provide the first formal separation result, proving that adding pause tokens to constant-depth, logarithmic-width Transformers strictly increases their computational expressivity. ... Empirically, we demonstrate that two-layer causally masked Transformers can learn parity when supplied with pause tokens, a function that they appear unable to learn without them.
Researcher Affiliation Academia Charles London Department of Computer Science University of Oxford Oxford, OX1 3QG EMAIL Varun Kanade Department of Computer Science University of Oxford Oxford, OX1 3QG EMAIL
Pseudocode No The paper describes methods and processes verbally and through figures (e.g., Figure 1: "Two layers of a Transformer with pause tokens can simulate a layer of a Boolean circuit.") but does not contain a formal pseudocode or algorithm block.
Open Source Code Yes Yes, code to reproduce the experimental results is provided in a .zip file, and experiment hyperparameters can be found in the appendix (and in the code).
Open Datasets No The datasets consist of uniformly sampled bitstrings, where the label is the parity. We generate a dataset per random seed for each length in {20, 50, 100, 150, 200, 250, 300}.
Dataset Splits Yes Training, validation, and test datasets consist of 500,000, 5,000, and 50,000 examples, respectively. The datasets consist of uniformly sampled bitstrings, where the label is the parity. We generate a dataset per random seed for each length in {20, 50, 100, 150, 200, 250, 300}.
Hardware Specification Yes All experiments were conducted using a single NVIDIA V100 GPU.
Software Dependencies No The paper mentions using a 'custom variant of Huggingface s GPT-2 implementation' but does not specify version numbers for any software libraries or dependencies. For example, it does not mention the Python version, PyTorch version, or CUDA version.
Experiment Setup Yes Our Transformer model uses 2 layers, 4 attention heads, and a hidden dimension of 32. Positional encodings are learned during training. All models are trained for 50 epochs using the Adam optimiser with a learning rate of 5 10 4, β1 = 0.9, β2 = 0.999, and no weight decay. We disable mixed precision and gradient clipping, as the models are small and training is stable.