On the Parameterization and Initialization of Diagonal State Space Models

Authors: Albert Gu, Karan Goel, Ankit Gupta, Christopher RĂ©

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experimental study shows that S4D has strong performance in a wide variety of domains and tasks, including the well-studied Long Range Arena (LRA) benchmark where the best S4D variant is competitive with S4 on all tasks and significantly outperforms all non-SSM baselines. We begin with controlled ablations of the various representations of diagonal state space models. Sections 5.1 and 5.2 ablate the proposed methods for parameterizing, computing, and initializing diagonal SSMs from Sections 3 and 4. Section 5.3 show full results of larger models on standard benchmarks,
Researcher Affiliation Collaboration Department of Computer Science, Stanford University IBM Research {albertgu,knrg}@stanford.edu, chrismre@cs.stanford.edu ankitgupta.iitkanpur@gmail.com
Pseudocode No The paper mentions that kernel computation 'requires just 2 lines of code' but does not provide a formal pseudocode block or algorithm listing.
Open Source Code Yes The code is a simple modification from the original S4 [9] repository and is publicly available.
Open Datasets Yes We focus on three datasets covering a varied range of data modalities (image pixels, biosignal time series, audio waveforms), sequence lengths (1K, 4K, 16K), and tasks (classification and regression with bidirectional and causal models). Sequential CIFAR (s CIFAR). CIFAR-10 images are flattened into a sequence of length 1024, and a bidirectional sequence model is used to perform 10-way classification. BIDMC Vital Signs. EKG and PPG signals of length 4000 are used to predict respiratory rate (RR), heart rate (HR), and blood oxygen saturation (Sp O2). Speech Commands (SC).2 A 1-second raw audio waveform comprising 16000 samples is used for 35-way spoken word classification.
Dataset Splits No The paper states 'We report the validation accuracy after 1 epoch of training on s CIFAR and SC' and refers to Appendix B for 'full protocol', but the provided text does not explicitly detail the training/validation/test dataset splits (e.g., percentages or sample counts).
Hardware Specification No The paper mentions 'modern parallelizable hardware such as GPUs' generally, and states that 'resource and timing information' are reported with the experiment code, but does not provide specific GPU models, CPU types, or other detailed hardware specifications in the main text provided.
Software Dependencies No The paper mentions 'deep learning frameworks such as Py Torch' but does not specify exact version numbers for any software dependencies.
Experiment Setup Yes We fix a simple architecture and training protocol that works generically. The architecture has 4 layers and hidden dimension H = 128, resulting in 100K parameters. All results are averaged over multiple seeds (full protocol and results including std. reported in Appendix B).