reproducibilityindex.ai

Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model

Authors: Lianghui Zhu, Bencheng Liao, Qian Zhang, Xinlong Wang, Wenyu Liu, Xinggang Wang

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct extensive experiments on Image Net classification and dense prediction downstream tasks. The results demonstrate that Vim achieves superior performance compared to the well-established and highlyoptimized plain vision Transformer, i.e., Dei T.
Researcher Affiliation	Collaboration	1School of EIC, Huazhong University of Science & Technology 2Institute of Artificial Intelligence, Huazhong University of Science & Technology 3Horizon Robotics 4Beijing Academy of Artificial Intelligence.
Pseudocode	Yes	Specifically, we present the operations of Vim block in Algo. 1.
Open Source Code	Yes	Code and models are released at https://github.com/hustvl/Vim
Open Datasets	Yes	We benchmark Vim on the Image Net-1K dataset (Deng et al., 2009), which contains 1.28M training images and 50K validation images from 1,000 categories.
Dataset Splits	Yes	We benchmark Vim on the Image Net-1K dataset (Deng et al., 2009), which contains 1.28M training images and 50K validation images from 1,000 categories. All models are trained on the training set, and top-1 accuracy on the validation set is reported.
Hardware Specification	Yes	Experiments are performed on 8 A800 GPUs.
Software Dependencies	No	The paper mentions software like AdamW, UperNet, and ViTDet frameworks, but does not provide specific version numbers for these or other key software dependencies (e.g., Python, PyTorch, CUDA versions).
Experiment Setup	Yes	Specifically, we apply random cropping, random horizontal flipping, label-smoothing regularization, mixup, and random erasing as data augmentations. When training on 2242 input images, we employ Adam W (Loshchilov & Hutter, 2019) with a momentum of 0.9, a total batch size of 1024, and a weight decay of 0.05 to optimize models. We train the Vim models for 300 epochs using a cosine schedule, 1 10 3 initial learning rate, and EMA.