VMamba: Visual State Space Model
Authors: Yue Liu, Yunjie Tian, Yuzhong Zhao, Hongtian Yu, Lingxi Xie, Yaowei Wang, Qixiang Ye, Jianbin Jiao, Yunfan Liu
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate VMamba s promising performance across diverse visual perception tasks, highlighting its superior input scaling efficiency compared to existing benchmark models. |
| Researcher Affiliation | Collaboration | Yue Liu 1 Yunjie Tian1 Yuzhong Zhao1 Hongtian Yu1 Lingxi Xie2 Yaowei Wang3 Qixiang Ye1 Jianbin Jiao1 Yunfan Liu1 1 UCAS 2 Huawei Inc. 3 Pengcheng Lab. |
| Pseudocode | No | The paper describes computational processes and network architectures with text and diagrams (e.g., Figure 2 and Figure 3 illustrate SS2D and network architecture) but does not include explicit pseudocode or algorithm blocks. |
| Open Source Code | Yes | Source code is available at https://github.com/Mzero Miko/VMamba |
| Open Datasets | Yes | VMamba consistently achieves higher image classification accuracy on Image Net-1K [9] across various model scales. [...] VMamba s superiority extends across multiple downstream tasks, with VMamba-Tiny/Small/Base achieving 47.3%/48.7%/49.2% m AP in object detection on COCO [33] (1 training schedule). As for single-scale semantic segmentation on ADE20K [68], VMamba-Tiny/Small/Base achieves 47.9%/50.6%/51.0% m Io U |
| Dataset Splits | Yes | We evaluate VMamba s performance in image classification on Image Net-1K [9], with comparison results against benchmark methods summarized in Table 1. [...] For object detection and instance segmentation, we adhere to the protocol outlined by Swin [36] and construct our models using the mmdetection framework [3]. [...] For semantic segmentation, we follow Swin [36] and construct a Uper Head [63] network on top of the pre-trained model using the MMSegmentation library [4]. |
| Hardware Specification | Yes | All experiments were conducted on a server with 8 NVIDIA Tesla-A100 GPUs. [...] Throughput values are measured with an A100 GPU and an AMD EPYC 7542 CPU, using the toolkit released by [62], following the protocol proposed in [36]. |
| Software Dependencies | No | The paper mentions the use of MMDetection [3] and MMSegmentation [4] libraries, and refers to 'torch.nn.functional.linear' for implementation. However, it does not specify the exact version numbers for these software dependencies (e.g., PyTorch version, MMDetection version) which would be necessary for full reproducibility. |
| Experiment Setup | Yes | Specifically, VMamba-T/S/B models are trained from scratch for 300 epochs, with a 20-epoch warm-up period, using a batch size of 1024. The training process utilizes the Adam W optimizer [38] with betas set to (0.9, 0.999), an initial learning rate of 1 10 3, a weight decay of 0.05, and a cosine decay learning rate scheduler. It is noteworthy that this is not the optimal setting for VMamba. With a learning rate of 2 10 3, the Top-1 accuracy of VMamba-T can reach 80.7%. Additional techniques such as label smoothing (0.1) and EMA (decay ratio of 0.9999) are also applied. The drop_path_ratio is set to 0.2 for Vanilla-VMamba-T and VMamba-T, 0.3 for Vanilla-VMamba-S, VMamba-S[s2l15] and VMamba-S[s1l20], 0.6 for Vanilla-VMamba-B and VMamba-B[s2l15], and 0.5 for VMamba-B[s1l20]. |