Demystify Mamba in Vision: A Linear Attention Perspective
Authors: Dongchen Han, Ziyi Wang, Zhuofan Xia, Yizeng Han, Yifan Pu, Chunjiang Ge, Jun Song, Shiji Song, Bo Zheng, Gao Huang
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | For each design, we meticulously analyze its pros and cons, and empirically evaluate its impact on model performance in vision tasks. |
| Researcher Affiliation | Collaboration | Dongchen Han1 Ziyi Wang1 Zhuofan Xia1 Yizeng Han1 Yifan Pu1 Chunjiang Ge1 Jun Song2 Shiji Song1 Bo Zheng2 Gao Huang1 1 Tsinghua University 2 Alibaba Group |
| Pseudocode | No | No sections labeled "Pseudocode" or "Algorithm" are found. The paper uses equations and diagrams (Fig 3, 7) to describe methods. |
| Open Source Code | Yes | Code is available at https://github.com/Leap Lab THU/MLLA. |
| Open Datasets | Yes | Image Net-1K classification [8], COCO object detection [30], and ADE20K semantic segmentation [55]. |
| Dataset Splits | Yes | Image Net-1K dataset comprises 1.28 million training images and 50,000 validation images, encompassing 1,000 classes. |
| Hardware Specification | Yes | Speed tests on a RTX3090 GPU. |
| Software Dependencies | No | Specifically, we utilize Adam W [34] optimizer to train all our models from scratch for 300 epochs. We apply a cosine learning rate decay schedule... Augmentation and regularization strategies includes Rand Augment [6], Mixup [53], Cut Mix [52], and random erasing [54]. In the training of MILA models, MESA [11] is employed to prevent overfitting. No version numbers for software or libraries are mentioned. |
| Experiment Setup | Yes | Specifically, we utilize Adam W [34] optimizer to train all our models from scratch for 300 epochs. We apply a cosine learning rate decay schedule with a linear warm-up of 20 epochs and a weight decay of 0.05. The total batch size is 4096 and initial learning rate is set to 4 10 3. Augmentation and regularization strategies includes Rand Augment [6], Mixup [53], Cut Mix [52], and random erasing [54]. In the training of MILA models, MESA [11] is employed to prevent overfitting. |