Multi-Scale VMamba: Hierarchy in Hierarchy Visual State Space Model

Authors: Yuheng Shi, Minjing Dong, Chang Xu

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments demonstrate that MSVMamba is highly competitive, with the MSVMamba-Tiny model achieving 83.0% top-1 accuracy on Image Net, 46.9% box m AP, and 42.5% instance m AP with the Mask R-CNN framework, 1x training schedule on COCO, and 47.9% m Io U with single-scale testing on ADE20K.
Researcher Affiliation Academia Yuheng Shi City University of Hong Kong yuhengshi99@gmail.com Minjing Dong City University of Hong Kong minjdong@cityu.edu.hk Chang Xu University of Sydney c.xu@sydney.edu.au
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes Code is available at https://github.com/Yu Hengsss/MSVMamba.
Open Datasets Yes Our models are trained and tested on the Image Net-1K dataset [3]. We evaluate our MSVMamba on the MSCOCO [28] dataset using the Mask R-CNN [17] framework. semantic segmentation on the ADE20K dataset [58] using the Uper Net framework [53].
Dataset Splits No The paper mentions training on datasets like Image Net-1K, MSCOCO, and ADE20K but does not provide specific train/validation/test split percentages, sample counts, or a detailed splitting methodology for reproduction. It mentions 'Image Net validation set' in Fig 2, but this does not constitute explicit split information for reproduction.
Hardware Specification Yes The latency was tested on a RTX 4090 GPU with a batch size of 128 using FP32 precision at an image resolution of 224. The training utilizes a batch size of 1024 across 8 GPUs. FPS and Memory are tested on a 4090 GPU with a batch size of 128 and FP32 precision.
Software Dependencies No The paper mentions software components like Adam W optimizer, Mask R-CNN framework, and Uper Net framework, but does not provide specific version numbers for any key software dependencies or libraries.
Experiment Setup Yes Our models are trained and tested on the Image Net-1K dataset [3]. In alignment with previous works [32, 30, 23], all models undergo training for 300 epochs, with the initial 20 epochs dedicated to warming up. The training utilizes a batch size of 1024 across 8 GPUs. We employ the Adam W optimizer, setting the betas to (0.9, 0.999) and momentum to 0.9. The learning rate is managed through a cosine decay scheduler, starting from an initial rate of 0.001, coupled with a weight decay of 0.05. Additionally, we leverage the exponential moving average (EMA) and implement label smoothing with a factor of 0.1 to enhance model performance and generalization. During testing, images are center cropped with the size of 224 224. We employ standard training strategies of 1 (12 epochs) and 3 (36 epochs) with Multi-Scale (MS) training for a fair comparison. The training process is conducted over 160K iterations with a batch size of 16. We employ the Adam W optimizer with a learning rate set at 6 10 5. Our experiments are primarily conducted using a default input resolution of 512 512.