reproducibilityindex.ai

MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer

Authors: Sachin Mehta, Mohammad Rastegari

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our results show that Mobile Vi T signiﬁcantly outperforms CNNand Vi T-based networks across different tasks and datasets. On the Image Net-1k dataset, Mobile Vi T achieves top-1 accuracy of 78.4% with about 6 million parameters, which is 3.2% and 6.2% more accurate than Mobile Netv3 (CNN-based) and De IT (Vi T-based) for a similar number of parameters.
Researcher Affiliation	Industry	Sachin Mehta Apple Mohammad Rastegari Apple
Pseudocode	Yes	Listing 1: Py Torch implementation of multi-scale sampler
Open Source Code	Yes	Our source code is open-source and available at: https://github.com/apple/ml-cvnets.
Open Datasets	Yes	We train Mobile Vi T models from scratch on the Image Net-1k classiﬁcation dataset (Russakovsky et al., 2015). We ﬁnetune Mobile Vi T...on the MS-COCO dataset (Lin et al., 2014)...We integrate Mobile Vi T with Deep Labv3 (Chen et al., 2017). We ﬁnetune Mobile Vi T...on the PASCAL VOC 2012 dataset (Everingham et al., 2015).
Dataset Splits	Yes	The dataset provides 1.28 million and 50 thousand images for training and validation, respectively. The Mobile Vi T networks are trained using Py Torch for 300 epochs on 8 NVIDIA GPUs with an effective batch size of 1,024 images using Adam W optimizer (Loshchilov & Hutter, 2019), label smoothing cross-entropy loss (smoothing=0.1), and multi-scale sampler (S = {(160, 160), (192, 192), (256, 256), (288, 288), (320, 320)}).
Hardware Specification	Yes	The Mobile Vi T networks are trained using Py Torch for 300 epochs on 8 NVIDIA GPUs...Their inference time is then measured (average over 100 iterations) on a mobile device, i.e., i Phone 12...Table 11: Inference time on different devices. i Phone12 CPU, i Phone12 Neural Engine, NVIDIA V100 GPU
Software Dependencies	No	The paper mentions 'Py Torch' but does not specify a version number or other software dependencies with version numbers.
Experiment Setup	Yes	The Mobile Vi T networks are trained using Py Torch for 300 epochs on 8 NVIDIA GPUs with an effective batch size of 1,024 images using Adam W optimizer (Loshchilov & Hutter, 2019), label smoothing cross-entropy loss (smoothing=0.1), and multi-scale sampler (S = {(160, 160), (192, 192), (256, 256), (288, 288), (320, 320)}). The learning rate is increased from 0.0002 to 0.002 for the ﬁrst 3k iterations and then annealed to 0.0002 using a cosine schedule (Loshchilov & Hutter, 2017). We use L2 weight decay of 0.01. We use basic data augmentation (i.e., random resized cropping and horizontal ﬂipping).