The Illusion of State in State-Space Models

Authors: William Merrill, Jackson Petty, Ashish Sabharwal

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our analysis reveals that the expressive power of S4, Mamba, and related SSMs is limited very similarly to transformers (within TC0), meaning these SSMs cannot solve simple state-tracking problems like permutation composition and consequently are provably unable to accurately track chess moves with certain notation, evaluate code, or track entities in a long narrative. To supplement our formal analysis, we report experiments showing that S4 and Mamba indeed struggle with state tracking.
Researcher Affiliation Collaboration 1New York University 2Allen Institute for AI. Correspondence to: William Merrill <willm@nyu.edu>, Jackson Petty <petty@nyu.edu>, Ashish Sabharwal <ashishs@allenai.org>.
Pseudocode No The paper does not contain any explicit pseudocode or algorithm blocks.
Open Source Code Yes Code: http://jpetty.org/ssm-illusion
Open Datasets No The paper mentions generating sequences from mathematical groups (A5, A4 Z5, or Z60) and including '3600 pairwise sequences of length 2 in the training data', but it does not provide concrete access information (link, DOI, citation) for a publicly available dataset.
Dataset Splits No The paper mentions evaluating on 'validation accuracy' but does not specify the exact percentages or counts for training, validation, and test splits.
Hardware Specification No The paper does not provide specific details about the hardware used for experiments, such as GPU or CPU models.
Software Dependencies No The paper does not list specific software dependencies with version numbers (e.g., Python, PyTorch, specific libraries).
Experiment Setup No The paper describes the task (token-tagging), model initialization (e.g., 'affine projection α as a random normal centered around the identity'), and training process (e.g., 'train models on sequences of length n'), but it lacks specific hyperparameter values like learning rate, batch size, or number of epochs.