Efficiently Modeling Long Sequences with Structured State Spaces

Authors: Albert Gu, Karan Goel, Christopher Re

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental S4 achieves strong empirical results across a diverse range of established benchmarks, including (i) 91% accuracy on sequential CIFAR-10 with no data augmentation or auxiliary losses, on par with a larger 2-D Res Net, (ii) substantially closing the gap to Transformers on image and language modeling tasks, while performing generation 60 faster (iii) SoTA on every task from the Long Range Arena benchmark, including solving the challenging Path-X task of length 16k that all prior work fails on, while being as efficient as all competitors.
Researcher Affiliation Academia Albert Gu & Karan Goel & Christopher R e Department of Computer Science, Stanford University {albertgu,krng}@stanford.edu, chrismre@cs.stanford.edu
Pseudocode Yes Algorithm 1 S4 CONVOLUTION KERNEL (SKETCH)
Open Source Code Yes Code is publicly available at https://github.com/Hazy Research/state-spaces.
Open Datasets Yes S4 achieves strong empirical results across a diverse range of established benchmarks, including (i) 91% accuracy on sequential CIFAR-10... (iii) SoTA on every task from the Long Range Arena benchmark... On CIFAR-10 density estimation... On Wiki Text-103 language modeling... Speech Commands dataset (Warden, 2018).
Dataset Splits Yes LRA (Tay et al., 2021) contains 6 tasks with lengths 1K-16K steps... Table 9: The values of the best hyperparameters found for classification datasets; LRA (Top) and images/speech (Bottom)... Evaluation was performed similarly to the basic setting in (Baevski & Auli, 2018), Table 5, which involves sliding non-overlapping windows of width 1024 tokens.
Hardware Specification Yes Benchmarking results from Table 1 and Table 2 were tested on a single A100 GPU. Some tasks used an A100 GPU (notably, the Path-X experiments), which has a larger max memory of 40Gb.
Software Dependencies No Our current implementation of S4 actually uses the naive O(NL) algorithm... we leverage the pykeops library for memory-efficient kernel operations. This code was only available in Tensor Flow... which were implemented in Py Torch.
Experiment Setup Yes Table 9: The values of the best hyperparameters found for classification datasets; LRA (Top) and images/speech (Bottom). LR is learning rate and WD is weight decay. BN and LN refer to Batch Normalization and Layer Normalization.