S4ND: Modeling Images and Videos as Multidimensional Signals with State Spaces
Authors: Eric Nguyen, Karan Goel, Albert Gu, Gordon Downs, Preey Shah, Tri Dao, Stephen Baccus, Christopher Ré
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We show that S4ND can model large-scale visual data in 1D, 2D, and 3D as continuous multidimensional signals and demonstrates strong performance by simply swapping Conv2D and self-attention layers with S4ND layers in existing state-of-the-art models. On ImageNet-1k, S4ND exceeds the performance of a Vision Transformer baseline by 1.5% when training with a 1D sequence of patches, and matches ConvNeXt when modeling images in 2D. For videos, S4ND improves on an inflated 3D ConvNeXt in activity classification on HMDB-51 by 4%. |
| Researcher Affiliation | Academia | Department of Bio Engineering, Stanford University Department of Computer Science, Stanford University Department of Neurobiology, Stanford University {etnguyen,albertgu,gwdowns,preey,trid,baccus}@stanford.edu {kgoel,chrismre}@cs.stanford.edu |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | No | The main method can be implemented based on the released S4 repository, and we will publically release our code. All hyperparameters and experimental details are reported in the Appendix. |
| Open Datasets | Yes | On ImageNet-1k, S4ND exceeds the performance of a Vision Transformer baseline by 1.5% when training with a 1D sequence of patches... activity classification on HMDB-51... On the standard CIFAR-10 [36] and Celeb-A [43] datasets... |
| Dataset Splits | Yes | For all ImageNet models, we train from scratch with no outside data and adopt the training procedure from [62, 69], which uses the AdamW optimizer [44] for 300 epochs... For CIFAR-10, we train with a low → base, 80/20 epoch schedule, and perform within 1% of an S4ND model trained with base resolution data while speeding up training by 21.8%. |
| Hardware Specification | No | The paper states 'Reported in the Appendix' regarding hardware, but specific hardware details such as GPU/CPU models are not present in the provided main text. |
| Software Dependencies | No | The paper mentions software components like 'AdamW optimizer', 'RandAugment', 'Mixup', 'AugMix', and 'PyTorch image models', but does not provide specific version numbers for these or any programming languages. |
| Experiment Setup | Yes | For all ImageNet models, we train from scratch with no outside data and adopt the training procedure from [62, 69], which uses the AdamW optimizer [44] for 300 epochs, cosine decay learning rate, weight decay 0.05... The initial learning rate for ViT (and S4ND-ViT) is 0.001, while for ConvNeXt (and S4ND-ConvNeXt) it is 0.004. |