Combining Recurrent, Convolutional, and Continuous-time Models with Linear State Space Layers

Authors: Albert Gu, Isys Johnson, Karan Goel, Khaled Saab, Tri Dao, Atri Rudra, Christopher Ré

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, stacking LSSL layers into a simple deep neural network obtains state-of-the-art results across time series benchmarks for long dependencies in sequential image classification, real-world healthcare regression tasks, and speech. On a difficult speech classification task with length-16000 sequences, LSSL outperforms prior approaches by 24 accuracy points, and even outperforms baselines that use handcrafted features on 100x shorter sequences.
Researcher Affiliation Academia Department of Computer Science, Stanford University Department of Electrical Engineering, Stanford University Department of Computer Science and Engineering, University at Buffalo, SUNY {albertgu,knrg,ksaab,trid}@stanford.edu, chrismre@cs.stanford.edu {isysjohn,atri}@buffalo.edu
Pseudocode No The paper does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code No The paper does not include an explicit statement about releasing source code or provide a link to a code repository for the described methodology.
Open Datasets Yes We test on the sequential MNIST, permuted MNIST, and sequential CIFAR tasks (Table 1), popular benchmarks which were originally designed to test the ability of recurrent models to capture long-term dependencies of length up to 1k [2]. We additionally use the BDIMC healthcare datasets (Table 2), a suite of widely studied time series regression problems of length 4000 on estimating vital signs. Table 4 reports results for the Speech Commands (SC) dataset [31] for classification of 1-second audio clips. we create a challenging new sequential-Celeb A task, where we classify 178 x 218 images = 38000-length sequences for 4 facial attributes: Attractive (Att.), Mouth Slightly Open (MSO), Smiling (Smil.), Wearing Lipstick (WL) [36].
Dataset Splits No While the paper mentions the use of datasets for benchmarks, it does not explicitly provide the specific train/validation/test splits (e.g., percentages or sample counts) used for reproducibility.
Hardware Specification No The paper mentions 'multi-GPU training' and 'Google Cloud credits' but does not specify any particular GPU models, CPU types, or other detailed hardware specifications used for the experiments.
Software Dependencies No The paper does not provide specific software names along with their version numbers (e.g., Python 3.8, PyTorch 1.9) needed for reproducibility.
Experiment Setup No The paper mentions that 'Full architecture details are described in Appendix B, including the initialization of A and t, computational details, and other architectural details.' and that 'we did light tuning primarily on learning rate and dropout', but it does not provide the specific hyperparameter values or detailed training configurations in the main text.