Mobile Attention: Mobile-Friendly Linear-Attention for Vision Transformers

Authors: Zhiyu Yao, Jian Wang, Haixu Wu, Jingdong Wang, Mingsheng Long

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental By replacing the standard attention of Vi Ts with Mobile-Attention, our optimized Vi Ts achieved enhanced model capacity and competitive performance in a range of computer vision tasks. Specifically, we have achieved remarkable reductions in latency on the i Phone 12. Code is available at https://github.com/thuml/Mobile Attention.
Researcher Affiliation Collaboration This work was done when Zhiyu Yao was intern at Baidu VIS. 1School of Software, BNRist, Tsinghua University, Beijing, China 2Baidu VIS, Beijing, China. Correspondence to: Mingsheng Long <mingsheng@tsinghua.edu.cn>.
Pseudocode No The paper describes procedures and equations in text and mathematical notation but does not include any structured pseudocode or algorithm blocks.
Open Source Code Yes Code is available at https://github.com/thuml/Mobile Attention.
Open Datasets Yes The image classification using the Image Net dataset (Deng et al., 2009) with 1.2 million training and 50K validation images. In order to showcase the generalization capability of Mobile-Attention, we apply our Mobile Attention mechanism to three popular vision transformers: Dei T (Touvron et al., 2021), PVT-v2 (Wang et al., 2022), and Efficientformer V2 (Li et al., 2022a), which is a stateof-the-art lightweight transformer model.
Dataset Splits Yes The image classification using the Image Net dataset (Deng et al., 2009) with 1.2 million training and 50K validation images. In order to showcase the generalization capability of Mobile-Attention, we apply our Mobile Attention mechanism to three popular vision transformers: Dei T (Touvron et al., 2021), PVT-v2 (Wang et al., 2022), and Efficientformer V2 (Li et al., 2022a), which is a stateof-the-art lightweight transformer model.
Hardware Specification Yes Our models were trained on a cluster of NVIDIA A100 GPUs to ensure optimal performance. Additionally, we measured the inference speed on mobile devices, specifically an i Phone 12 with an A14 Bionic chip running i OS version 15. ... We also tested model latency on a Pixel 6 (Android) CPU.
Software Dependencies Yes To implement the Vi T-Mobi Att framework, we utilized Py Torch 1.11, following common practices in recent research such as Swin Transformer (Liu et al., 2021) and T2t-Vi T (Yuan et al., 2021).
Experiment Setup Yes For the classification task, we employ the Adam W optimizer (Loshchilov & Hutter, 2017) and train the model for 300 epochs. We set the batch size to 2048 and the learning rate to 0.001, using a cosine learning rate decay schedule. The resolution of the input image is resized to 224 224. ... We employ the Adam W optimizer (Loshchilov & Hutter, 2017) with an initial learning rate of 0.0002 and train the model for 12 epochs. The input size is set to 1333 800. ... We adopted the Adam W optimizer (Loshchilov & Hutter, 2017) and implemented a polynomial learning rate schedule with a power of 0.9, starting from an initial learning rate of 0.0002. During training, we resized and cropped the input images to 512 512, and for testing on the validation set, we set the shorter side to 512, following common practices in segmentation.