MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer

Authors: Sachin Mehta, Mohammad Rastegari

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our results show that Mobile Vi T significantly outperforms CNNand Vi T-based networks across different tasks and datasets. On the Image Net-1k dataset, Mobile Vi T achieves top-1 accuracy of 78.4% with about 6 million parameters, which is 3.2% and 6.2% more accurate than Mobile Netv3 (CNN-based) and De IT (Vi T-based) for a similar number of parameters.
Researcher Affiliation Industry Sachin Mehta Apple Mohammad Rastegari Apple
Pseudocode Yes Listing 1: Py Torch implementation of multi-scale sampler
Open Source Code Yes Our source code is open-source and available at: https://github.com/apple/ml-cvnets.
Open Datasets Yes We train Mobile Vi T models from scratch on the Image Net-1k classification dataset (Russakovsky et al., 2015). We finetune Mobile Vi T...on the MS-COCO dataset (Lin et al., 2014)...We integrate Mobile Vi T with Deep Labv3 (Chen et al., 2017). We finetune Mobile Vi T...on the PASCAL VOC 2012 dataset (Everingham et al., 2015).
Dataset Splits Yes The dataset provides 1.28 million and 50 thousand images for training and validation, respectively. The Mobile Vi T networks are trained using Py Torch for 300 epochs on 8 NVIDIA GPUs with an effective batch size of 1,024 images using Adam W optimizer (Loshchilov & Hutter, 2019), label smoothing cross-entropy loss (smoothing=0.1), and multi-scale sampler (S = {(160, 160), (192, 192), (256, 256), (288, 288), (320, 320)}).
Hardware Specification Yes The Mobile Vi T networks are trained using Py Torch for 300 epochs on 8 NVIDIA GPUs...Their inference time is then measured (average over 100 iterations) on a mobile device, i.e., i Phone 12...Table 11: Inference time on different devices. i Phone12 CPU, i Phone12 Neural Engine, NVIDIA V100 GPU
Software Dependencies No The paper mentions 'Py Torch' but does not specify a version number or other software dependencies with version numbers.
Experiment Setup Yes The Mobile Vi T networks are trained using Py Torch for 300 epochs on 8 NVIDIA GPUs with an effective batch size of 1,024 images using Adam W optimizer (Loshchilov & Hutter, 2019), label smoothing cross-entropy loss (smoothing=0.1), and multi-scale sampler (S = {(160, 160), (192, 192), (256, 256), (288, 288), (320, 320)}). The learning rate is increased from 0.0002 to 0.002 for the first 3k iterations and then annealed to 0.0002 using a cosine schedule (Loshchilov & Hutter, 2017). We use L2 weight decay of 0.01. We use basic data augmentation (i.e., random resized cropping and horizontal flipping).