Efficiently Modeling Long Sequences with Structured State Spaces
Authors: Albert Gu, Karan Goel, Christopher Re
ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | S4 achieves strong empirical results across a diverse range of established benchmarks, including (i) 91% accuracy on sequential CIFAR-10 with no data augmentation or auxiliary losses, on par with a larger 2-D Res Net, (ii) substantially closing the gap to Transformers on image and language modeling tasks, while performing generation 60 faster (iii) SoTA on every task from the Long Range Arena benchmark, including solving the challenging Path-X task of length 16k that all prior work fails on, while being as efficient as all competitors. |
| Researcher Affiliation | Academia | Albert Gu & Karan Goel & Christopher R e Department of Computer Science, Stanford University {albertgu,krng}@stanford.edu, chrismre@cs.stanford.edu |
| Pseudocode | Yes | Algorithm 1 S4 CONVOLUTION KERNEL (SKETCH) |
| Open Source Code | Yes | Code is publicly available at https://github.com/Hazy Research/state-spaces. |
| Open Datasets | Yes | S4 achieves strong empirical results across a diverse range of established benchmarks, including (i) 91% accuracy on sequential CIFAR-10... (iii) SoTA on every task from the Long Range Arena benchmark... On CIFAR-10 density estimation... On Wiki Text-103 language modeling... Speech Commands dataset (Warden, 2018). |
| Dataset Splits | Yes | LRA (Tay et al., 2021) contains 6 tasks with lengths 1K-16K steps... Table 9: The values of the best hyperparameters found for classification datasets; LRA (Top) and images/speech (Bottom)... Evaluation was performed similarly to the basic setting in (Baevski & Auli, 2018), Table 5, which involves sliding non-overlapping windows of width 1024 tokens. |
| Hardware Specification | Yes | Benchmarking results from Table 1 and Table 2 were tested on a single A100 GPU. Some tasks used an A100 GPU (notably, the Path-X experiments), which has a larger max memory of 40Gb. |
| Software Dependencies | No | Our current implementation of S4 actually uses the naive O(NL) algorithm... we leverage the pykeops library for memory-efficient kernel operations. This code was only available in Tensor Flow... which were implemented in Py Torch. |
| Experiment Setup | Yes | Table 9: The values of the best hyperparameters found for classification datasets; LRA (Top) and images/speech (Bottom). LR is learning rate and WD is weight decay. BN and LN refer to Batch Normalization and Layer Normalization. |