Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Bridging Expressivity and Scalability with Adaptive Unitary SSMs

Authors: Arjun Karuvally, Franz Nowak, Andy Keller, Carmen Amo Alonso, Terrence J. Sejnowski, Hava T. Siegelmann

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirically, we show that AUSSM and its hybrid variant interleaved with Mamba outperform prior SSMs on formal algorithmic tasks such as parity and modular arithmetic, and achieve competent performance on real-world long time-series classification benchmarks. Our results demonstrate that adaptive unitary recurrence provides a powerful and efficient inductive bias for both symbolic and continuous sequence modeling.
Researcher Affiliation	Academia	Arjun Karuvally Salk Institute for Biological Studies EMAIL Franz Nowak ETH Zürich EMAIL T. Anderson Keller The Kempner Institute for the Study of Natural and Artificial Intelligence at Harvard University EMAIL Carmen Amo Alonso Computer Science Department Stanford University EMAIL Terrence J. Sejnowski Salk Institute for Biological Studies EMAIL Hava T. Siegelmann University of Massachusetts Amherst EMAIL
Pseudocode	No	The paper contains Python code snippets in Appendix G.1 but they are presented as implementation examples rather than formally labeled pseudocode or algorithm blocks.
Open Source Code	Yes	The code is available at https://github.com/arjunkaruvally/AUSSM
Open Datasets	Yes	Empirically, we validate the theoretical claims through a suite of algorithmic tasks, demonstrating substantial performance gains over Mamba, and showing that AUSSM retains competitive efficiency through an optimized CUDA implementation. Further, we evaluate the long-range modeling capabilities by testing on a suite of time series benchmarks. ... To evaluate the practical benefits of our architecture, we test the hybrid AUSSM+Mamba model on a suite of UEA long-time-series classification benchmarks [32] and the challenging Weather regression benchmark. ... For these tasks, we release the dataloaders along with the code. The code will be made public following the publication of the manuscript.
Dataset Splits	Yes	For testing, we modified the procedure as the five arbitrary random seeds used to evaluate test performance in prior works may introduce unwanted biases due to the low number of random samples. Also, prior works used JAX for implementations, while we used Py Torch, and the random seed does not create the same train-validation-test sets due to differences in the pseudorandom number generators. We thus decided to evaluate on train-validation-test splits created with 20 different seeds. ... The validation set is sampled independently from 40-256 sequence lengths and had 1,000 samples. The test set had 10,000 samples from sequences of up to 256 sequence lengths.
Hardware Specification	Yes	Experiments were run on a single NVIDIA 2080 Ti GPU with 11 GB VRAM. ... All the models were run in a supercomputing cluster, where we used 40 2080Ti GPUs for all except the dataset Eigenworms dataset that required higher memory. This is the lowest GPU available in the cluster, with at least a CUDA compute of 7.5 required to run the Mamba and AUSSM CUDA kernels. For a larger memory Eigenworms workload, we used the L4 GPU, which has a VRAM of 23GB.
Software Dependencies	No	The paper mentions using Py Torch in Appendix H.2, but does not provide specific version numbers for Py Torch or other key software dependencies.
Experiment Setup	Yes	For each task, we performed a hyperparameter search over the following grid: d {16, 64, 128}, n {16, 64, 128}, learning rate {0.00001, 0.0001, 0.001}, and five different seeds for model selection. The model hyperparameters with the highest mean validation accuracy are chosen for evaluation in the test set. ... The batch size was fixed at 256. For pure AUSSM blocks, we tested networks with a depth of 2, 4, and 6. For hybrid AUSSM blocks, we tested all possible 2-block configurations of Mamba (represented as m) and AUSSM blocks (represented as a) {ma, am, mm, aa}. ... Table 6: Best Hyperparameters.