MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer
Authors: Sachin Mehta, Mohammad Rastegari
ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our results show that Mobile Vi T significantly outperforms CNNand Vi T-based networks across different tasks and datasets. On the Image Net-1k dataset, Mobile Vi T achieves top-1 accuracy of 78.4% with about 6 million parameters, which is 3.2% and 6.2% more accurate than Mobile Netv3 (CNN-based) and De IT (Vi T-based) for a similar number of parameters. |
| Researcher Affiliation | Industry | Sachin Mehta Apple Mohammad Rastegari Apple |
| Pseudocode | Yes | Listing 1: Py Torch implementation of multi-scale sampler |
| Open Source Code | Yes | Our source code is open-source and available at: https://github.com/apple/ml-cvnets. |
| Open Datasets | Yes | We train Mobile Vi T models from scratch on the Image Net-1k classification dataset (Russakovsky et al., 2015). We finetune Mobile Vi T...on the MS-COCO dataset (Lin et al., 2014)...We integrate Mobile Vi T with Deep Labv3 (Chen et al., 2017). We finetune Mobile Vi T...on the PASCAL VOC 2012 dataset (Everingham et al., 2015). |
| Dataset Splits | Yes | The dataset provides 1.28 million and 50 thousand images for training and validation, respectively. The Mobile Vi T networks are trained using Py Torch for 300 epochs on 8 NVIDIA GPUs with an effective batch size of 1,024 images using Adam W optimizer (Loshchilov & Hutter, 2019), label smoothing cross-entropy loss (smoothing=0.1), and multi-scale sampler (S = {(160, 160), (192, 192), (256, 256), (288, 288), (320, 320)}). |
| Hardware Specification | Yes | The Mobile Vi T networks are trained using Py Torch for 300 epochs on 8 NVIDIA GPUs...Their inference time is then measured (average over 100 iterations) on a mobile device, i.e., i Phone 12...Table 11: Inference time on different devices. i Phone12 CPU, i Phone12 Neural Engine, NVIDIA V100 GPU |
| Software Dependencies | No | The paper mentions 'Py Torch' but does not specify a version number or other software dependencies with version numbers. |
| Experiment Setup | Yes | The Mobile Vi T networks are trained using Py Torch for 300 epochs on 8 NVIDIA GPUs with an effective batch size of 1,024 images using Adam W optimizer (Loshchilov & Hutter, 2019), label smoothing cross-entropy loss (smoothing=0.1), and multi-scale sampler (S = {(160, 160), (192, 192), (256, 256), (288, 288), (320, 320)}). The learning rate is increased from 0.0002 to 0.002 for the first 3k iterations and then annealed to 0.0002 using a cosine schedule (Loshchilov & Hutter, 2017). We use L2 weight decay of 0.01. We use basic data augmentation (i.e., random resized cropping and horizontal flipping). |