The Expressive Capacity of State Space Models: A Formal Language Perspective
Authors: Yash Sarrof, Yana Veitsman, Michael Hahn
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We present a comprehensive theoretical study of the capacity of such SSMs as it compares to that of transformers and traditional RNNs. We find that SSMs and transformers have overlapping but distinct strengths. In star-free state tracking, SSMs implement length-generalizing solutions to problems that transformers struggle to represent exactly. They can also model bounded hierarchical structure with optimal memory even without simulating a stack. On the other hand, we identify a design choice in current SSMs that limits their expressive power. We discuss implications for SSM and LM research, and verify results empirically 1 on a recent SSM, Mamba. |
| Researcher Affiliation | Academia | Yash Sarrof, Yana Veitsman, Michael Hahn Saarland Informatics Campus Saarland University, Germany {ysarrof, yanav, mhahn}@lst.uni-saarland.de |
| Pseudocode | No | The paper describes algorithms and constructions verbally and with figures (e.g., Figure 3), but it does not include formal pseudocode blocks or algorithm listings. |
| Open Source Code | Yes | 1Code is available at: https://github.com/lacoco-lab/ssm_expressivity |
| Open Datasets | Yes | We obtained the dataset of Liu et al. [2023a] from their release, https://huggingface.co/ datasets/synthseq/flipflop (MIT license). For all the languages, we use either the data prepared by Bhattamishra et al. [2020] or where not available their data-generation scripts, allowing full comparability with results they reported for transformers. We used their official code and data release at https://github.com/satwik77/Transformer-Formal-Languages (last commit 48eea2e; MIT license). |
| Dataset Splits | Yes | The training and the validation set contained samples of length 700, while the test set contained samples of length 700 n 1400. Training inputs have length in [1,50]; the model is evaluated on held-out bins with length [1,50] and [51,100]. There are two heldout bins: one with in-distribution lengths ([1,50]), and one testing length generalization (lengths [51,100]). The first one was used for hyperparameter optimization. |
| Hardware Specification | No | The paper states, 'All experiments used the Mamba reference implementation' but does not provide specific details on the CPU, GPU, or memory used for the experiments. |
| Software Dependencies | No | The paper mentions using the 'Mamba reference implementation' and the 'Adam W optimizer' but does not specify version numbers for these or any other software dependencies (e.g., Python version, PyTorch version). |
| Experiment Setup | Yes | For each language, we conducted extensive hyperparameter search. We varied the dmodel parameter in Mamba across the set {16, 32, 64, 128, 256}. Additionally, we experimented with the number of layers in our model, ranging from 1 to 3, training each configuration for 100 epochs. We used the Adam W optimizer. To identify optimal learning rates, we started with a coarse hyperparameter search using values from the set {0.001, 0.0001, 0.00001}. Finally, we varied the batch size from {16, 32, 64} for datasets with 10K training examples. For languages like anbn with limited training size, we searched for an optimal batch size within the set {5, 10}. |