Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model
Authors: Lianghui Zhu, Bencheng Liao, Qian Zhang, Xinlong Wang, Wenyu Liu, Xinggang Wang
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct extensive experiments on Image Net classification and dense prediction downstream tasks. The results demonstrate that Vim achieves superior performance compared to the well-established and highlyoptimized plain vision Transformer, i.e., Dei T. |
| Researcher Affiliation | Collaboration | 1School of EIC, Huazhong University of Science & Technology 2Institute of Artificial Intelligence, Huazhong University of Science & Technology 3Horizon Robotics 4Beijing Academy of Artificial Intelligence. |
| Pseudocode | Yes | Specifically, we present the operations of Vim block in Algo. 1. |
| Open Source Code | Yes | Code and models are released at https://github.com/hustvl/Vim |
| Open Datasets | Yes | We benchmark Vim on the Image Net-1K dataset (Deng et al., 2009), which contains 1.28M training images and 50K validation images from 1,000 categories. |
| Dataset Splits | Yes | We benchmark Vim on the Image Net-1K dataset (Deng et al., 2009), which contains 1.28M training images and 50K validation images from 1,000 categories. All models are trained on the training set, and top-1 accuracy on the validation set is reported. |
| Hardware Specification | Yes | Experiments are performed on 8 A800 GPUs. |
| Software Dependencies | No | The paper mentions software like AdamW, UperNet, and ViTDet frameworks, but does not provide specific version numbers for these or other key software dependencies (e.g., Python, PyTorch, CUDA versions). |
| Experiment Setup | Yes | Specifically, we apply random cropping, random horizontal flipping, label-smoothing regularization, mixup, and random erasing as data augmentations. When training on 2242 input images, we employ Adam W (Loshchilov & Hutter, 2019) with a momentum of 0.9, a total batch size of 1024, and a weight decay of 0.05 to optimize models. We train the Vim models for 300 epochs using a cosine schedule, 1 10 3 initial learning rate, and EMA. |