QuadMamba: Learning Quadtree-based Selective Scan for Visual State Space Model

Authors: Fei Xie, Weijia Zhang, Zhongdao Wang, Chao Ma

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate that Quad Mamba achieves state-of-the-art performance in various vision tasks, including image classification, object detection, instance segmentation, and semantic segmentation. and We conduct experiments on commonly used benchmarks, including Image Net-1k [29] for image classification, MS COCO2017 [37] for object detection and instance segmentation, and ADE20K [78] for semantic segmentation.
Researcher Affiliation Collaboration Fei Xie1 Weijia Zhang1 Zhongdao Wang2 Chao Ma1 1 Mo E Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University 2 Huawei Noah s Ark Lab
Pseudocode Yes In Sec.A.4, we also provide the pseudo-code to help understand the key operations within the Quad VSS block. and Algorithm 1 Py Torch code of Quad VSS block, Algorithm 2 Py Torch code of Quadtree window partition at two levels, Algorithm 3 Py Torch code of Quadtree window restoration at two levels, Algorithm 4 Py Torch code of differentiable sequence masking.
Open Source Code Yes The code is in https://github.com/VISION-SJTU/Quad Mamba.
Open Datasets Yes We conduct experiments on commonly used benchmarks, including Image Net-1k [29] for image classification, MS COCO2017 [37] for object detection and instance segmentation, and ADE20K [78] for semantic segmentation.
Dataset Splits Yes Image Net [29] is widely recognized as the standard for image classification benchmarks, consisting of around 1.3 million training images and 50,000 validation images spread across 1,000 classes.
Hardware Specification Yes Our models are implemented with Py Torch and Timm libraries and trained on A800 GPUs. and Measurements are taken with an A800 GPU.
Software Dependencies No Our models are implemented with Py Torch and Timm libraries and trained on A800 GPUs. (Does not specify version numbers for PyTorch or Timm.)
Experiment Setup Yes The data augmentation techniques used include random resized crop (input image size of 224x224), horizontal flip, Rand Augment [77], Mixup [70], Cut Mix [69], Random Erasing [77], and color jitter. Additionally, regularization techniques such as weight decay, stochastic depth [24], and label smoothing [56] are applied. All models are trained using Adam W [45]. The learning rate scaling rule is calculated as Batch Size / 1024 * 10^-3. and The learning rate is set as 6 x 10^-5. The fine-tuning process consists of a total of 160,000 iterations with a batch size of 16.