Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Characterizing the Expressivity of Fixed-Precision Transformer Language Models

Authors: Jiaoda Li, Ryan Cotterell

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Finally, we present empirical results that align closely with our theory: transformers trained on languages within their characterized expressive capacity generalize reliably across sequence lengths, while they consistently fail to generalize on languages beyond it.1
Researcher Affiliation	Academia	Jiaoda Li Ryan Cotterell EMAIL
Pseudocode	No	The paper describes theoretical constructions and proofs for logical equivalences and transformer behavior, but does not include structured pseudocode or algorithm blocks.
Open Source Code	Yes	Code available at Git Hub repository. Our code is adapted from https://github.com/google-deepmind/ neural_networks_chomsky_hierarchy, licensed under the Apache License, Version 2.0.
Open Datasets	No	We construct a suite of languages spanning a fine-grained hierarchy of formal language classes. Our results (Tab. 1) exhibit strong alignment between theory and practice: for all languages that transformers are predicted to recognize, the models generalize perfectly over lengths (100% accuracy); for languages beyond their theoretical capacity, they consistently make generalization errors, regardless of learning rates or random seeds.
Dataset Splits	Yes	Models are trained on strings up to length 40, and tested on strings of length 41–500. Each experiment is run with 5 different random seeds and 3 learning rates. We consider a transformer to have successfully recognized a language if it achieves 100% accuracy in at least one of the runs. Details of the experimental setup and model configurations are provided in E.1.1.
Hardware Specification	Yes	All experiments were conducted on a single GPU with 24 GB of memory, each taking approximately one hour to complete.
Software Dependencies	No	Our code is adapted from https://github.com/google-deepmind/ neural_networks_chomsky_hierarchy, licensed under the Apache License, Version 2.0.
Experiment Setup	Yes	We use a transformer with soft attention, strict future masking, L = 5 layers, model size D = 64, and No PE. Training strings are of length up to 40, and tested on strings of length 41–500. The model is trained for 1,000,000 steps with a batch size of 128. For evaluation, we generate 512 samples per test length. For comparison, we also train a long short-term memory (LSTM) [20] with a hidden size of 256. Each experiment is run with 5 different random seeds and 3 learning rates (1e-4, 3e-4, 5e-4).