MambaTree: Tree Topology is All You Need in State Space Model

Authors: Yicheng Xiao, Lin Song, shaoli huang, Jiangshan Wang, Siyu Song, Yixiao Ge, Xiu Li, Ying Shan

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate that our method significantly outperforms existing structured state space models on image classification, object detection and segmentation. Besides, by fine-tuning large language models, our approach achieves consistent improvements in multiple textual tasks at minor training cost.
Researcher Affiliation Collaboration 1Tsinghua Shenzhen International Graduate School, Tsinghua University 2ARC Lab, Tencent PCG 3Tencent AI Lab 4South China Normal University
Pseudocode Yes Algorithm 1 Vision Tree Scanning
Open Source Code Yes Code is available at https://github.com/Eason Xiao-888/Groot VL.
Open Datasets Yes We assess the classification performance of Mamba Tree V on the Image Net-1k dataset [12]. Following previous practices [43, 44, 62, 41], all Mamba Tree V models are trained for 300 epochs from scratch using Adam W optimizer with a warm-up strategy of 20 epochs. We verify the detection performance of Mamba Tree V on the MSCOCO 2017 dataset [39]. To evaluate the semantic segmentation performance of our Mamba Tree V series, we train our models with Uper Net [65] initialized by pre-trained classification weights on ADE20K[75]. We regard Mamba [19] with 130M parameters as the base model. ... we first fine-tune pre-trained Mamba via Lo RA [33] and Mamba Tree L under the same setting with the Alpaca data [58], which contains 52000 instruction tuning data for supervised fine-tuning.
Dataset Splits Yes The comparison results summarized in Table 1 show Mamba Tree V leading all SSM-based models and competitive with advanced CNNs and Transformers across tiny, small, and base scales. Specifically, Mamba Tree V-T achieves 83.4% Top-1 Acc. boosting Vi M-S by 2.9%, Local Vim-S by 2.2%, Plain Mamba-L2 by 1.8% and VMamba-T by 0.9% with similar FLOPs. Additionally, it surpasses Conv Ne Xt-T by 1.3% and Swin-T by 2.2%, demonstrating the effectiveness of our method. We assess the classification performance of Mamba Tree V on the Image Net-1k dataset [12].
Hardware Specification Yes As shown in Table 7, we report the inference throughputs of our method on an Nvidia V100 GPU. The models are trained with thirty-two 32GB V100 GPUs by default. The models are trained with eight 32GB V100 GPUs by default.
Software Dependencies No The paper mentions 'Adam W optimizer', 'MMDetection library', 'Uper Net', and 'lm-evaluation-harness project', but does not specify their version numbers.
Experiment Setup Yes All Mamba Tree V models are trained for 300 epochs from scratch using Adam W optimizer with a warm-up strategy of 20 epochs. During training, we utilize a Cosine Scheduler with an initial learning rate of 1 10 3 and weight decay of 0.05. In addition, the exponential moving average (EMA) is also applied. We adopt the Adam W optimizer with a learning rate of 1 10 4 and batch size of 16. The training schedules include 1 (12 epochs) and 3 (36 epochs) with multi-scale data augmentation.