Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Pause Tokens Strictly Increase the Expressivity of Constant-Depth Transformers
Authors: Charles London, Varun Kanade
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We provide the first formal separation result, proving that adding pause tokens to constant-depth, logarithmic-width Transformers strictly increases their computational expressivity. ... Empirically, we demonstrate that two-layer causally masked Transformers can learn parity when supplied with pause tokens, a function that they appear unable to learn without them. |
| Researcher Affiliation | Academia | Charles London Department of Computer Science University of Oxford Oxford, OX1 3QG EMAIL Varun Kanade Department of Computer Science University of Oxford Oxford, OX1 3QG EMAIL |
| Pseudocode | No | The paper describes methods and processes verbally and through figures (e.g., Figure 1: "Two layers of a Transformer with pause tokens can simulate a layer of a Boolean circuit.") but does not contain a formal pseudocode or algorithm block. |
| Open Source Code | Yes | Yes, code to reproduce the experimental results is provided in a .zip file, and experiment hyperparameters can be found in the appendix (and in the code). |
| Open Datasets | No | The datasets consist of uniformly sampled bitstrings, where the label is the parity. We generate a dataset per random seed for each length in {20, 50, 100, 150, 200, 250, 300}. |
| Dataset Splits | Yes | Training, validation, and test datasets consist of 500,000, 5,000, and 50,000 examples, respectively. The datasets consist of uniformly sampled bitstrings, where the label is the parity. We generate a dataset per random seed for each length in {20, 50, 100, 150, 200, 250, 300}. |
| Hardware Specification | Yes | All experiments were conducted using a single NVIDIA V100 GPU. |
| Software Dependencies | No | The paper mentions using a 'custom variant of Huggingface s GPT-2 implementation' but does not specify version numbers for any software libraries or dependencies. For example, it does not mention the Python version, PyTorch version, or CUDA version. |
| Experiment Setup | Yes | Our Transformer model uses 2 layers, 4 attention heads, and a hidden dimension of 32. Positional encodings are learned during training. All models are trained for 50 epochs using the Adam optimiser with a learning rate of 5 10 4, β1 = 0.9, β2 = 0.999, and no weight decay. We disable mixed precision and gradient clipping, as the models are small and training is stable. |