Diagonal State Spaces are as Effective as Structured State Spaces

Authors: Ankit Gupta, Albert Gu, Jonathan Berant

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate the performance of DSS on Long Range Arena (LRA) which is a suite of sequence-level classification tasks with diverse input lengths (1K-16K) requiring similarity, structural, and visual-spatial reasoning over a wide range of modalities such as text, natural/synthetic images, and mathematical expressions. Despite its simplicity, DSS delivers an average accuracy of 81.88 across the 6 tasks of LRA, comparable to the state-of-the-art performance of S4 (80.21).
Researcher Affiliation Collaboration Ankit Gupta IBM Research ankitgupta.iitkanpur@gmail.com Albert Gu Stanford University albertgu@stanford.edu Jonathan Berant Tel Aviv University joberant@cs.tau.ac.il
Pseudocode Yes Algorithm 1: DSSSOFTMAX KERNEL (SKETCH)
Open Source Code Yes Our code is available at https://github.com/ag1988/dss.
Open Datasets Yes We evaluate the performance of DSS on Long Range Arena (LRA) which is a suite of sequence-level classification tasks with diverse input lengths (1K-16K)...
Dataset Splits Yes Long Range Arena (LRA) [TDA 21] is a standard benchmark for assessing the ability of models to process long sequences.
Hardware Specification No Our experiments were conducted on IBM s Cognitive Computing Cluster, with additional resources from Tel Aviv University. The paper also mentions in its checklist (3d) that resources are specified in A.3, but A.3 is not provided in the main text.
Software Dependencies No The paper mentions 'Py Torch implementation' but does not provide specific version numbers for PyTorch or any other software dependencies in the main text. Details might be in A.3, which is not provided.
Experiment Setup Yes The real and imaginary parts of each element of W are initialized from Np0, 1q. Each element of log is initialized as er where r Uplogp.001q, logp.1qq. P CN is initialized using eigenvalues of the normal part of normal plus low-rank form of Hi PPO matrix [GGR22]. Concretely, re, im are initialized such that the resulting is the vector of those N eigenvalues of the following 2N ˆ 2N matrix which have a positive imaginary part. In all our experiments, we used the above initialization with N 64. The initial learning rate of all DSS parameters was 10 3 and weight decay was not applied to them.