Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Multi-Scale VMamba: Hierarchy in Hierarchy Visual State Space Model

Authors: Yuheng Shi, Minjing Dong, Chang Xu

NeurIPS 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments demonstrate that MSVMamba is highly competitive, with the MSVMamba-Tiny model achieving 83.0% top-1 accuracy on Image Net, 46.9% box m AP, and 42.5% instance m AP with the Mask R-CNN framework, 1x training schedule on COCO, and 47.9% m Io U with single-scale testing on ADE20K.
Researcher Affiliation	Academia	Yuheng Shi City University of Hong Kong EMAIL Minjing Dong City University of Hong Kong EMAIL Chang Xu University of Sydney EMAIL
Pseudocode	No	The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code	Yes	Code is available at https://github.com/Yu Hengsss/MSVMamba.
Open Datasets	Yes	Our models are trained and tested on the Image Net-1K dataset [3]. We evaluate our MSVMamba on the MSCOCO [28] dataset using the Mask R-CNN [17] framework. semantic segmentation on the ADE20K dataset [58] using the Uper Net framework [53].
Dataset Splits	No	The paper mentions training on datasets like Image Net-1K, MSCOCO, and ADE20K but does not provide specific train/validation/test split percentages, sample counts, or a detailed splitting methodology for reproduction. It mentions 'Image Net validation set' in Fig 2, but this does not constitute explicit split information for reproduction.
Hardware Specification	Yes	The latency was tested on a RTX 4090 GPU with a batch size of 128 using FP32 precision at an image resolution of 224. The training utilizes a batch size of 1024 across 8 GPUs. FPS and Memory are tested on a 4090 GPU with a batch size of 128 and FP32 precision.
Software Dependencies	No	The paper mentions software components like Adam W optimizer, Mask R-CNN framework, and Uper Net framework, but does not provide specific version numbers for any key software dependencies or libraries.
Experiment Setup	Yes	Our models are trained and tested on the Image Net-1K dataset [3]. In alignment with previous works [32, 30, 23], all models undergo training for 300 epochs, with the initial 20 epochs dedicated to warming up. The training utilizes a batch size of 1024 across 8 GPUs. We employ the Adam W optimizer, setting the betas to (0.9, 0.999) and momentum to 0.9. The learning rate is managed through a cosine decay scheduler, starting from an initial rate of 0.001, coupled with a weight decay of 0.05. Additionally, we leverage the exponential moving average (EMA) and implement label smoothing with a factor of 0.1 to enhance model performance and generalization. During testing, images are center cropped with the size of 224 224. We employ standard training strategies of 1 (12 epochs) and 3 (36 epochs) with Multi-Scale (MS) training for a fair comparison. The training process is conducted over 160K iterations with a batch size of 16. We employ the Adam W optimizer with a learning rate set at 6 10 5. Our experiments are primarily conducted using a default input resolution of 512 512.