Demystify Mamba in Vision: A Linear Attention Perspective

Authors: Dongchen Han, Ziyi Wang, Zhuofan Xia, Yizeng Han, Yifan Pu, Chunjiang Ge, Jun Song, Shiji Song, Bo Zheng, Gao Huang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental For each design, we meticulously analyze its pros and cons, and empirically evaluate its impact on model performance in vision tasks.
Researcher Affiliation Collaboration Dongchen Han1 Ziyi Wang1 Zhuofan Xia1 Yizeng Han1 Yifan Pu1 Chunjiang Ge1 Jun Song2 Shiji Song1 Bo Zheng2 Gao Huang1 1 Tsinghua University 2 Alibaba Group
Pseudocode No No sections labeled "Pseudocode" or "Algorithm" are found. The paper uses equations and diagrams (Fig 3, 7) to describe methods.
Open Source Code Yes Code is available at https://github.com/Leap Lab THU/MLLA.
Open Datasets Yes Image Net-1K classification [8], COCO object detection [30], and ADE20K semantic segmentation [55].
Dataset Splits Yes Image Net-1K dataset comprises 1.28 million training images and 50,000 validation images, encompassing 1,000 classes.
Hardware Specification Yes Speed tests on a RTX3090 GPU.
Software Dependencies No Specifically, we utilize Adam W [34] optimizer to train all our models from scratch for 300 epochs. We apply a cosine learning rate decay schedule... Augmentation and regularization strategies includes Rand Augment [6], Mixup [53], Cut Mix [52], and random erasing [54]. In the training of MILA models, MESA [11] is employed to prevent overfitting. No version numbers for software or libraries are mentioned.
Experiment Setup Yes Specifically, we utilize Adam W [34] optimizer to train all our models from scratch for 300 epochs. We apply a cosine learning rate decay schedule with a linear warm-up of 20 epochs and a weight decay of 0.05. The total batch size is 4096 and initial learning rate is set to 4 10 3. Augmentation and regularization strategies includes Rand Augment [6], Mixup [53], Cut Mix [52], and random erasing [54]. In the training of MILA models, MESA [11] is employed to prevent overfitting.