FasterViT: Fast Vision Transformers with Hierarchical Attention

Authors: Ali Hatamizadeh, Greg Heinrich, Hongxu Yin, Andrew Tao, Jose M. Alvarez, Jan Kautz, Pavlo Molchanov

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We have extensively validated its effectiveness on various CV tasks including classification, object detection and segmentation. We also show that HAT can be used as a plug-and-play module for existing networks and enhance them. We further demonstrate significantly faster and more accurate performance than competitive counterparts for images with high resolution. Code is available at https://github.com/NVlabs/FasterViT.
Researcher Affiliation Industry Ali Hatamizadeh, Greg Heinrich, Hongxu Yin, Andrew Tao, Jose M. Alvarez, Jan Kautz, Pavlo Molchanov NVIDIA {ahatamizadeh, pmolchanov}@nvidia.com
Pseudocode No The paper describes the Hierarchical Attention mechanism and other components using mathematical equations and textual explanations of procedures, but it does not include any blocks explicitly labeled as "Pseudocode" or "Algorithm".
Open Source Code Yes Code is available at https://github.com/NVlabs/FasterViT.
Open Datasets Yes We employ the Image Net-1K dataset (Deng et al., 2009) for classification that includes 1.2M and 50K training and validation images. In addition, we use Image Net-21K dataset which has 14M images with 21841 classes for pretraining. ... We used the MS COCO dataset (Lin et al., 2014) to finetune a Cascade Mask-RCNN network (He et al., 2017) with pretrained Faster Vi T backbones. ... For semantic segmentation, we employed ADE20K dataset (Zhou et al., 2017) to finetune an Uper Net network (Xiao et al., 2018) with pre-trained Faster Vi T backbones.
Dataset Splits Yes We employ the Image Net-1K dataset (Deng et al., 2009) for classification that includes 1.2M and 50K training and validation images.
Hardware Specification Yes Throughput is measured on A100 GPU with batch size of 128. ... We train all Faster Vi T models by using LAMB optimizer (You et al., 2019) optimizer for 300 epochs with a learning rate of 5e-3 and a total batch size of 4096 using 32 A100 GPUs. ... In order to validate the effectiveness of Faster Vi T on different platforms, we present additional throughput comparisons on different hardware such as NVIDIA V100, NVIDIA TITAN RTX and NVIDIA A6000 GPUs, Jetson Nano and Intel(R) Xeon(R) E5-2698 v4 CPU.
Software Dependencies Yes All throughput numbers and insights presented in the main paper were computed using Py Torch v1.13.
Experiment Setup Yes We train all Faster Vi T models by using LAMB optimizer (You et al., 2019) optimizer for 300 epochs with a learning rate of 5e-3 and a total batch size of 4096 using 32 A100 GPUs. For data augmentation, we follow same strategies as in previous efforts (Liu et al., 2022b; 2021). We also use Exponential Moving Average (EMA) which often improves the performance. Further details on training settings can be found in the appendix. For pre-training on Image Net-21K, we train the models for 90 epochs with a learning rate of 4e-3. In addition, we fine-tune the models for 60 epochs with a learning rate of 7e-5. ... we trained all models with Adam W (Loshchilov & Hutter, 2017) optimizer with an initial learning rate of 1e-4, a 3ˆ schedule, weight decay of 5e-2 and a total batch size of 16 on 8 A100 GPUs. ... Specifically, we trained all models with Adam-W (Loshchilov & Hutter, 2017) optimizer and by using a learning rate of 6e-5, weight decay of 1e-2 and total batch size of 16 on 8 A100 GPUs.