MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models

Authors: Chenglin Yang, Siyuan Qiao, Qihang Yu, Xiaoding Yuan, Yukun Zhu, Alan Yuille, Hartwig Adam, Liang-Chieh Chen

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our conceptually simple MOAT networks are surprisingly effective, achieving 89.1% / 81.5% top-1 accuracy on Image Net-1K / Image Net-1K-V2 with Image Net22K pretraining. Additionally, MOAT can be seamlessly applied to downstream tasks that require large resolution inputs... on COCO object detection, MOAT achieves 59.2% APbox... and on ADE20K semantic segmentation, MOAT attains 57.6% m Io U... The tiny-MOAT family is also benchmarked on downstream tasks, serving as a baseline for the community.
Researcher Affiliation Collaboration 1The Johns Hopkins University 2Google Research
Pseudocode No The paper provides formal mathematical representations of blocks (e.g., Eq. 1-9) and architectural diagrams (Figure 1), but it does not include pseudocode or clearly labeled algorithm blocks.
Open Source Code Yes Code is publicly available.1 Official code in Tensor Flow: https://github.com/google-research/deeplab2
Open Datasets Yes The Image Net-1K dataset (Russakovsky et al., 2015) contains 1.2M training images with 1000 classes. We also experiment with pretraining on the larger Image Net-22K dataset... We train Cascade Mask R-CNN (Cai & Vasconcelos, 2018; He et al., 2017) on the COCO 2017 dataset (Lin et al., 2014)... We experiment with the proposed MOAT models on ADE20K semantic segmentation dataset (Zhou et al., 2019).
Dataset Splits Yes We report top-1 accuracy on the Image Net-1K validation set, using the last checkpoint. We train MOAT models on Image Net-1K with resolution 224 for 300 epochs. If pretraining on the larger Image Net-22K, we use resolution 224 and 90 epochs. Afterwards, the models are fine-tuned on Image Net-1K for 30 epochs. During fine-tuning, we also experiment with larger resolutions (e.g., 384 and 512).
Hardware Specification Yes We use 16 TPUv4 cores for training MOAT-{0,1,2} and 32 TPUv4 cores for MOAT-3... We re-implement MOAT with the popular timm (Wightman, 2019) library in Py Torch, and measure the throughput on an Nvidia V100 GPU.
Software Dependencies No The paper mentions 'Tensor Flow' and 'timm (Wightman, 2019) library in Py Torch'. However, it does not provide specific version numbers for these software components, which is required for a reproducible description.
Experiment Setup Yes We train MOAT models on Image Net-1K with resolution 224 for 300 epochs... We employ the typical regularization methods during training, such as label smoothing (Szegedy et al., 2016), Rand Augment (Cubuk et al., 2020), Mix Up (Zhang et al., 2017), stochastic depth (Huang et al., 2016), and Adam (Kingma & Ba, 2015) with decoupled weight decay (i.e., Adam W (Loshchilov & Hutter, 2019)). See Tab. 10 and Tab. 11 for detailed hyper-parameters.