Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
S4ND: Modeling Images and Videos as Multidimensional Signals with State Spaces
Authors: Eric Nguyen, Karan Goel, Albert Gu, Gordon Downs, Preey Shah, Tri Dao, Stephen Baccus, Christopher Ré
NeurIPS 2022 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We show that S4ND can model large-scale visual data in 1D, 2D, and 3D as continuous multidimensional signals and demonstrates strong performance by simply swapping Conv2D and self-attention layers with S4ND layers in existing state-of-the-art models. On ImageNet-1k, S4ND exceeds the performance of a Vision Transformer baseline by 1.5% when training with a 1D sequence of patches, and matches ConvNeXt when modeling images in 2D. For videos, S4ND improves on an inflated 3D ConvNeXt in activity classification on HMDB-51 by 4%. |
| Researcher Affiliation | Academia | Department of Bio Engineering, Stanford University Department of Computer Science, Stanford University Department of Neurobiology, Stanford University EMAIL EMAIL |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | No | The main method can be implemented based on the released S4 repository, and we will publically release our code. All hyperparameters and experimental details are reported in the Appendix. |
| Open Datasets | Yes | On ImageNet-1k, S4ND exceeds the performance of a Vision Transformer baseline by 1.5% when training with a 1D sequence of patches... activity classification on HMDB-51... On the standard CIFAR-10 [36] and Celeb-A [43] datasets... |
| Dataset Splits | Yes | For all ImageNet models, we train from scratch with no outside data and adopt the training procedure from [62, 69], which uses the AdamW optimizer [44] for 300 epochs... For CIFAR-10, we train with a low → base, 80/20 epoch schedule, and perform within 1% of an S4ND model trained with base resolution data while speeding up training by 21.8%. |
| Hardware Specification | No | The paper states 'Reported in the Appendix' regarding hardware, but specific hardware details such as GPU/CPU models are not present in the provided main text. |
| Software Dependencies | No | The paper mentions software components like 'AdamW optimizer', 'RandAugment', 'Mixup', 'AugMix', and 'PyTorch image models', but does not provide specific version numbers for these or any programming languages. |
| Experiment Setup | Yes | For all ImageNet models, we train from scratch with no outside data and adopt the training procedure from [62, 69], which uses the AdamW optimizer [44] for 300 epochs, cosine decay learning rate, weight decay 0.05... The initial learning rate for ViT (and S4ND-ViT) is 0.001, while for ConvNeXt (and S4ND-ConvNeXt) it is 0.004. |