Multi-Scale VMamba: Hierarchy in Hierarchy Visual State Space Model
Authors: Yuheng Shi, Minjing Dong, Chang Xu
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments demonstrate that MSVMamba is highly competitive, with the MSVMamba-Tiny model achieving 83.0% top-1 accuracy on Image Net, 46.9% box m AP, and 42.5% instance m AP with the Mask R-CNN framework, 1x training schedule on COCO, and 47.9% m Io U with single-scale testing on ADE20K. |
| Researcher Affiliation | Academia | Yuheng Shi City University of Hong Kong yuhengshi99@gmail.com Minjing Dong City University of Hong Kong minjdong@cityu.edu.hk Chang Xu University of Sydney c.xu@sydney.edu.au |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code is available at https://github.com/Yu Hengsss/MSVMamba. |
| Open Datasets | Yes | Our models are trained and tested on the Image Net-1K dataset [3]. We evaluate our MSVMamba on the MSCOCO [28] dataset using the Mask R-CNN [17] framework. semantic segmentation on the ADE20K dataset [58] using the Uper Net framework [53]. |
| Dataset Splits | No | The paper mentions training on datasets like Image Net-1K, MSCOCO, and ADE20K but does not provide specific train/validation/test split percentages, sample counts, or a detailed splitting methodology for reproduction. It mentions 'Image Net validation set' in Fig 2, but this does not constitute explicit split information for reproduction. |
| Hardware Specification | Yes | The latency was tested on a RTX 4090 GPU with a batch size of 128 using FP32 precision at an image resolution of 224. The training utilizes a batch size of 1024 across 8 GPUs. FPS and Memory are tested on a 4090 GPU with a batch size of 128 and FP32 precision. |
| Software Dependencies | No | The paper mentions software components like Adam W optimizer, Mask R-CNN framework, and Uper Net framework, but does not provide specific version numbers for any key software dependencies or libraries. |
| Experiment Setup | Yes | Our models are trained and tested on the Image Net-1K dataset [3]. In alignment with previous works [32, 30, 23], all models undergo training for 300 epochs, with the initial 20 epochs dedicated to warming up. The training utilizes a batch size of 1024 across 8 GPUs. We employ the Adam W optimizer, setting the betas to (0.9, 0.999) and momentum to 0.9. The learning rate is managed through a cosine decay scheduler, starting from an initial rate of 0.001, coupled with a weight decay of 0.05. Additionally, we leverage the exponential moving average (EMA) and implement label smoothing with a factor of 0.1 to enhance model performance and generalization. During testing, images are center cropped with the size of 224 224. We employ standard training strategies of 1 (12 epochs) and 3 (36 epochs) with Multi-Scale (MS) training for a fair comparison. The training process is conducted over 160K iterations with a batch size of 16. We employ the Adam W optimizer with a learning rate set at 6 10 5. Our experiments are primarily conducted using a default input resolution of 512 512. |