Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model

Authors: Lianghui Zhu, Bencheng Liao, Qian Zhang, Xinlong Wang, Wenyu Liu, Xinggang Wang

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct extensive experiments on Image Net classification and dense prediction downstream tasks. The results demonstrate that Vim achieves superior performance compared to the well-established and highlyoptimized plain vision Transformer, i.e., Dei T.
Researcher Affiliation Collaboration 1School of EIC, Huazhong University of Science & Technology 2Institute of Artificial Intelligence, Huazhong University of Science & Technology 3Horizon Robotics 4Beijing Academy of Artificial Intelligence.
Pseudocode Yes Specifically, we present the operations of Vim block in Algo. 1.
Open Source Code Yes Code and models are released at https://github.com/hustvl/Vim
Open Datasets Yes We benchmark Vim on the Image Net-1K dataset (Deng et al., 2009), which contains 1.28M training images and 50K validation images from 1,000 categories.
Dataset Splits Yes We benchmark Vim on the Image Net-1K dataset (Deng et al., 2009), which contains 1.28M training images and 50K validation images from 1,000 categories. All models are trained on the training set, and top-1 accuracy on the validation set is reported.
Hardware Specification Yes Experiments are performed on 8 A800 GPUs.
Software Dependencies No The paper mentions software like AdamW, UperNet, and ViTDet frameworks, but does not provide specific version numbers for these or other key software dependencies (e.g., Python, PyTorch, CUDA versions).
Experiment Setup Yes Specifically, we apply random cropping, random horizontal flipping, label-smoothing regularization, mixup, and random erasing as data augmentations. When training on 2242 input images, we employ Adam W (Loshchilov & Hutter, 2019) with a momentum of 0.9, a total batch size of 1024, and a weight decay of 0.05 to optimize models. We train the Vim models for 300 epochs using a cosine schedule, 1 10 3 initial learning rate, and EMA.