Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Hankel Singular Value Regularization for Highly Compressible State Space Models

Authors: Paul Schwerdtner, Jules Berman, Benjamin Peherstorfer

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on Long Range Arena benchmarks demonstrate that the regularized state space layers are up to 10 more compressible than standard state space layers while maintaining high accuracy. Experiments with standard LRA benchmark examples demonstrate that we can compress models by up to 90% while maintaining acceptable accuracy.
Researcher Affiliation Academia Paul Schwerdtner Courant Institute of Mathematical Sciences New York University New York, NY 10012 EMAIL
Pseudocode Yes Algorithm 1 Bisection method for reduced state dimension determination
Open Source Code Yes An implementation is provided at https://github.com/Algopaul/hankelreg. Our jax implementation is available at www.github.com/Algopaul/hankelreg.
Open Datasets Yes The first example consists of the 32 32 CIFAR-10 images [37] that are converted to grayscale, flattened into 1,024-length sequences, and normalized to zero mean and unit variance across the entire dataset. [...] The second example is also a sequentialized image classification task and consists of the 28 28 grayscale MNIST [38] images, where again each image is flattened into a sequence of 784 scalar values. [...] The third task uses the IMDB sentiment dataset [41], where movie reviews are represented as sequences of one-hot encoded characters with 129 possible values, padded to a maximum length of 4,096. [...] Finally, we consider the PATH and PATH-X datasets, which consist of the flattened pathfinder images [40]
Dataset Splits Yes It includes 50,000 training, and 10,000 test samples and has ten target classes. [...] the dataset includes 25,000 training and 25,000 test examples.
Hardware Specification Yes on a single H100 GPU
Software Dependencies No The paper mentions "flax/nnx implementation" and "jax implementation" but does not provide specific version numbers for these or other software components.
Experiment Setup Yes We select the state, input, and output dimensions of our SSMs according to the setup in [53]. In particular, we use a state dimension n = 384, and input and output dimensions m = p = 512 for s CIFAR 10 (grayscale), n = m = p = 128 for s MNIST, n = 192, m = p = 256 for IMDB, n = 256, m = p = 192 for PATH, and n = 256, m = p = 128 for PATH-X. As in [53] for s CIFAR 10 (grayscale) IMDB, PATH, and PATH-X, we use 6 SSM layers and for s MNIST we use 4 layers. The remainder of the model architecture, which we describe alongside the training parameters in the Appendix in Section B, is also the same as in [53]. In all examples, we use HSVR with the Hankel nuclear norm regularizer (11), even though other regularizers based on the Hankel singular values could be used, which remains future work. One notable difference compared to [53, 26] is that we only use unidirectional associative scans, whereas [53, 26] scan bidirectionally for s CIFAR (grayscale) and IMDB. ... In Table 2 we show the parameters used to generate our results. The parameters for dropout, weight-decay and regularization magnitude (the scalar by which we multiply our regularizer (11)) are found via grid-search. ... The SSM parameters ρ are initialized using Gaussian distributions with mean 1.5, and standard deviation 0.25, which yields to an eigenvalue distribution similar to that of Hi PPO matrices after discretization. ... The matrices B and C in each state space layer are initialized with zero-mean Gaussian distributions, with standard deviation 1/p n2 + m2 and 1/p n2 + p2, respectively.