S4ND: Modeling Images and Videos as Multidimensional Signals with State Spaces

Authors: Eric Nguyen, Karan Goel, Albert Gu, Gordon Downs, Preey Shah, Tri Dao, Stephen Baccus, Christopher Ré

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show that S4ND can model large-scale visual data in 1D, 2D, and 3D as continuous multidimensional signals and demonstrates strong performance by simply swapping Conv2D and self-attention layers with S4ND layers in existing state-of-the-art models. On ImageNet-1k, S4ND exceeds the performance of a Vision Transformer baseline by 1.5% when training with a 1D sequence of patches, and matches ConvNeXt when modeling images in 2D. For videos, S4ND improves on an inflated 3D ConvNeXt in activity classification on HMDB-51 by 4%.
Researcher Affiliation Academia Department of Bio Engineering, Stanford University Department of Computer Science, Stanford University Department of Neurobiology, Stanford University {etnguyen,albertgu,gwdowns,preey,trid,baccus}@stanford.edu {kgoel,chrismre}@cs.stanford.edu
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code No The main method can be implemented based on the released S4 repository, and we will publically release our code. All hyperparameters and experimental details are reported in the Appendix.
Open Datasets Yes On ImageNet-1k, S4ND exceeds the performance of a Vision Transformer baseline by 1.5% when training with a 1D sequence of patches... activity classification on HMDB-51... On the standard CIFAR-10 [36] and Celeb-A [43] datasets...
Dataset Splits Yes For all ImageNet models, we train from scratch with no outside data and adopt the training procedure from [62, 69], which uses the AdamW optimizer [44] for 300 epochs... For CIFAR-10, we train with a low → base, 80/20 epoch schedule, and perform within 1% of an S4ND model trained with base resolution data while speeding up training by 21.8%.
Hardware Specification No The paper states 'Reported in the Appendix' regarding hardware, but specific hardware details such as GPU/CPU models are not present in the provided main text.
Software Dependencies No The paper mentions software components like 'AdamW optimizer', 'RandAugment', 'Mixup', 'AugMix', and 'PyTorch image models', but does not provide specific version numbers for these or any programming languages.
Experiment Setup Yes For all ImageNet models, we train from scratch with no outside data and adopt the training procedure from [62, 69], which uses the AdamW optimizer [44] for 300 epochs, cosine decay learning rate, weight decay 0.05... The initial learning rate for ViT (and S4ND-ViT) is 0.001, while for ConvNeXt (and S4ND-ConvNeXt) it is 0.004.