The Expressive Capacity of State Space Models: A Formal Language Perspective

Authors: Yash Sarrof, Yana Veitsman, Michael Hahn

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We present a comprehensive theoretical study of the capacity of such SSMs as it compares to that of transformers and traditional RNNs. We find that SSMs and transformers have overlapping but distinct strengths. In star-free state tracking, SSMs implement length-generalizing solutions to problems that transformers struggle to represent exactly. They can also model bounded hierarchical structure with optimal memory even without simulating a stack. On the other hand, we identify a design choice in current SSMs that limits their expressive power. We discuss implications for SSM and LM research, and verify results empirically 1 on a recent SSM, Mamba.
Researcher Affiliation Academia Yash Sarrof, Yana Veitsman, Michael Hahn Saarland Informatics Campus Saarland University, Germany {ysarrof, yanav, mhahn}@lst.uni-saarland.de
Pseudocode No The paper describes algorithms and constructions verbally and with figures (e.g., Figure 3), but it does not include formal pseudocode blocks or algorithm listings.
Open Source Code Yes 1Code is available at: https://github.com/lacoco-lab/ssm_expressivity
Open Datasets Yes We obtained the dataset of Liu et al. [2023a] from their release, https://huggingface.co/ datasets/synthseq/flipflop (MIT license). For all the languages, we use either the data prepared by Bhattamishra et al. [2020] or where not available their data-generation scripts, allowing full comparability with results they reported for transformers. We used their official code and data release at https://github.com/satwik77/Transformer-Formal-Languages (last commit 48eea2e; MIT license).
Dataset Splits Yes The training and the validation set contained samples of length 700, while the test set contained samples of length 700 n 1400. Training inputs have length in [1,50]; the model is evaluated on held-out bins with length [1,50] and [51,100]. There are two heldout bins: one with in-distribution lengths ([1,50]), and one testing length generalization (lengths [51,100]). The first one was used for hyperparameter optimization.
Hardware Specification No The paper states, 'All experiments used the Mamba reference implementation' but does not provide specific details on the CPU, GPU, or memory used for the experiments.
Software Dependencies No The paper mentions using the 'Mamba reference implementation' and the 'Adam W optimizer' but does not specify version numbers for these or any other software dependencies (e.g., Python version, PyTorch version).
Experiment Setup Yes For each language, we conducted extensive hyperparameter search. We varied the dmodel parameter in Mamba across the set {16, 32, 64, 128, 256}. Additionally, we experimented with the number of layers in our model, ranging from 1 to 3, training each configuration for 100 epochs. We used the Adam W optimizer. To identify optimal learning rates, we started with a coarse hyperparameter search using values from the set {0.001, 0.0001, 0.00001}. Finally, we varied the batch size from {16, 32, 64} for datasets with 10K training examples. For languages like anbn with limited training size, we searched for an optimal batch size within the set {5, 10}.