Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Metric Automata Theory: A Unifying Theory of RNNs

Authors: Adam Dankowiakowski, Alessandro Ronca

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	5 Empirical Validation of Our Results Mamba performance on star-free tasks. The experiments presented by [Sarrof et al., 2024] demonstrate that Mamba can effectively learn star-free languages with length-generalisation abilities. On the benchmark from [Bhattamishra et al., 2023], Mamba performed perfectly on all 11 star-free tasks, also on out-of-distribution input lengths. This is consistent with its expressivity described by Theorem 8. We performed additional experiments on the dataset from [Liu et al., 2023]. It introduces the task of realizing the FLIP-FLOP by predictively modelling a sequence of instructions. We found that in the case of training 1-layer Mamba, despite achieving accuracy 1 on all validation datasets, iterating the ignore instruction indeed leads to incorrect outputs, as predicted by our results for η-finite systems, namely Theorem 7. See Figures 5,10 and Appendix F for details.
Researcher Affiliation	Collaboration	Adam Dankowiakowski University of Oxford EMAIL Alessandro Ronca IRIS-AI EMAIL
Pseudocode	No	The paper describes methods and concepts using mathematical notation and textual explanations, but it does not contain any explicitly labeled pseudocode blocks or algorithms.
Open Source Code	Yes	The code used to perform the experiments is based on the repository shared in Grazzi et al. [2025], with some environment modifications to make it work on the 2025-04-09 Google Colab release. The forked repository is available at https://github.com/adankow/unlocking_state_tracking, with a Google Colab notebook file containing the set-up, simple training loop, and hidden state visualisation code.
Open Datasets	Yes	We performed additional experiments on the dataset from [Liu et al., 2023]. It introduces the task of realizing the FLIP-FLOP by predictively modelling a sequence of instructions. The dataset is available at https://huggingface.co/datasets/synthseq/flipflop/.
Dataset Splits	No	We trained 1-layer Mamba on sequence lengths 32, 64, and 512, observing similar state-collapse phenomena, as predicted by our results. Additionally [Sarrof et al., 2024] note that in their experiments Mamba needed more training steps to converge than reported by Liu et al. [2023] for an LSTM. This is another evidence towards the influence of robustness on stability of training.
Hardware Specification	No	The paper mentions running experiments on a "Google Colab release" which implies cloud computing resources, but it does not specify any particular GPU models, CPU types, or other detailed hardware specifications used for the experiments.
Software Dependencies	No	The code used to perform the experiments is based on the repository shared in Grazzi et al. [2025], with some environment modifications to make it work on the 2025-04-09 Google Colab release.
Experiment Setup	No	We trained 1-layer Mamba on sequence lengths 32, 64, and 512, observing similar state-collapse phenomena, as predicted by our results. Additionally [Sarrof et al., 2024] note that in their experiments Mamba needed more training steps to converge than reported by Liu et al. [2023] for an LSTM. This is another evidence towards the influence of robustness on stability of training.