A 2-Dimensional State Space Layer for Spatial Inductive Bias

Authors: Ethan Baron, Itamar Zimerman, Lior Wolf

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, we observe that incorporating our layer at the beginning of each transformer block of Vision Transformers (Vi T), as well as when replacing the Conv2D filters of Conv Ne XT with our proposed layers significantly enhances performance for multiple backbones and across multiple datasets. The new layer is effective even with a negligible amount of additional parameters and inference time. Ablation studies and visualizations demonstrate that the layer has a strong 2-D inductive bias.
Researcher Affiliation Academia Ethan Baron , Itamar Zimerman & Lior Wolf The Blavatnik School of Computer Science, Tel Aviv University {barone,zimerman1}@mail.tau.ac.il wolf@cs.tau.ac.il
Pseudocode No The paper describes computational steps and formulations, but does not include structured pseudocode or algorithm blocks clearly labeled as such.
Open Source Code Yes Our code is available at this git https URL.
Open Datasets Yes Results using Vi T, MEGA and Swin backbones on the Tiny Image Net (T-IN) and CIFAR-100 (C100) datasets." and "The Dei T and Swin backbones were tested on the large-scale Celeb-A dataset (Liu et al., 2015) and Image Net-100." and "We also conduct an experiment on the Image Net-1K dataset (Deng et al., 2009), and as shown in Tab. 5, we improve over MEGA s Vi T-T results by 0.4% in both Top 1 accuracy and Top 5 accuracy.
Dataset Splits Yes We tested Swin and Vi T over the small datasets Tiny-Image Net and CIFAR-100, using the results reported by (Lee et al., 2021) as baseline." and "The Dei T and Swin backbones were tested on the large-scale Celeb-A dataset (Liu et al., 2015) and Image Net-100." and "We examined our model on CIFAR-10 Grayscale in an isotropic manner (without decreasing the image size along the architecture, and without patches), which is part of the Long Range Arena benchmark (Tay et al., 2020).
Hardware Specification No The paper does not provide specific hardware details (exact GPU/CPU models, processor types, or memory amounts) used for running its experiments.
Software Dependencies No We use Py Torch for all experiments." (No version numbers provided for PyTorch or other software dependencies).
Experiment Setup Yes All experiment results were averaged over seeds = [0, 1, 2]. For all datasets and backbones, we set nssm = 8, N = 16 for all SSM-Based variants (SSM-2D real & complex and S4ND)... Adam W was used as the optimizer. Weight decay was set to 0.05 (apart from SSM layer where it was set to 0), batch size to 128, and warm-up to 10. All models were trained for 100 epochs, and cosine learning rate decay was used. The initial learning rate was set to 0.003.