Theoretical Foundations of Deep Selective State-Space Models

Authors: Nicola Muca Cirone, Antonio Orvieto, Benjamin Walker, Cristopher Salvi, Terry Lyons

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical In this paper, we give theoretical grounding to the selectivity mechanism, often linked to in-context learning, using tools from Rough Path Theory. We provide a framework for the theoretical analysis of generalized selective SSMs, fully characterizing their expressive power and identifying the gating mechanism as the crucial architectural choice. Our analysis provides a closed-form description of the expressive powers of modern SSMs, such as Mamba, quantifying theoretically the drastic improvement in performance from the previous generation of models, such as S4. Our theory not only motivates the success of modern selective state-space models, but also provides a solid framework to understand the expressive power of future SSM variants.
Researcher Affiliation Academia Nicola Muca Cirone Department of Mathematics Imperial College London Antonio Orvieto MPI for Intelligent Systems, Tübingen AI Center ELLIS Institute Tübingen Benjamin Walker Mathematical Institute University of Oxford Cristopher Salvi Department of Mathematics Imperial College London Terry Lyons Mathematical Institute University of Oxford
Pseudocode No No pseudocode or clearly labeled algorithm block was found in the paper.
Open Source Code Yes Code to reproduce all of our experiments can be found at: https://github.com/Benjamin-Walker/selective-ssms-and-linear-cdes
Open Datasets Yes The first task is based on a dataset from Walker et al. [2024] where the aim is to predict terms in the anti-symmetric part of the input path s signature. The second task is the A5 benchmark from Merrill et al. [2024].
Dataset Splits Yes For each model, we plotted the mean and range of the validation accuracy over 5 independent runs.
Hardware Specification No The paper mentions 'University of Oxford Advanced Research Computing (ARC) facility' but does not specify any particular hardware details such as GPU models, CPU types, or memory specifications used for the experiments.
Software Dependencies No The paper does not list specific version numbers for software components or libraries used in the experiments.
Experiment Setup Yes All models use a hidden dimension of 256, with the state space models using a state dimension of 256. The state space models are trained using gradient descent with a batch size of 32 and Adam with a learning rate of 10-4. The output from the linear CDE s recurrence is obtained using the Tsit5 adaptive ODE solver, with an absolute and relative tolerance of 10-2. The linear CDEs linear readout is optimised via ordinary least squares. ... The RNN, transformer, S4, and Mamba use a hidden dimension of 1024, with the state space models using a state dimension of 64 and the transformer using 64 heads. ... Furthermore, all models have trainable matrices in their recurrences and are trained using batch gradient descent with a batch size of 32 and Adam W with a weight decay of 0.01.