Global Context Vision Transformers

Authors: Ali Hatamizadeh, Hongxu Yin, Greg Heinrich, Jan Kautz, Pavlo Molchanov

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our proposed GC Vi T achieves state-of-the-art results across image classification, object detection and semantic segmentation tasks. On Image Net-1K dataset for classification, the variants of GC Vi T with 51M, 90M and 201M parameters achieve 84.3%, 85.0% and 85.7% Top-1 accuracy, respectively, at 224 224 image resolution and without any pre-training, hence surpassing comparably-sized prior art such as CNN-based Conv Ne Xt and Vi Tbased Max Vi T and Swin Transformer by a large margin. Pre-trained GC Vi T backbones in downstream tasks of object detection, instance segmentation, and semantic segmentation using MS COCO and ADE20K datasets outperform prior work consistently.
Researcher Affiliation Industry 1NVIDIA. Correspondence to: Ali Hatamizadeh <ahatamizadeh@nvidia.com>.
Pseudocode Yes Algorithm. 1 Global Attention Pseudocode # Input/output shape: (B*, N, C); # B*: Aggregated Batch Size; H: Height; # W: Width; C: dim; q_g: Global Token; # F: Num Attention Head; N: H x W. def init(): f = nn.Linear(C, 2*C) softmax = nn.Softmax(dim=-1) def forward(x, q_g): B*, N, C = x.shape B, C, h, w = q_g.shape kv = f(x).reshape(B*, N, 2, F, C // F) kv = kv.permute(2, 0, 3, 1, 4) k, v = split(kv, (1, 1), 0) q_g = q_g.repeat(1, B* // B, 1, 1) q_g = q_g.reshape(B*, F, N, C // F) qk = matmul(q_g,k.transpose(-2, -1)) attn = softmax(qk) return matmul(attn, v).reshape(B*, N, C)
Open Source Code Yes Code is available at https://github.com/NVlabs/GCVi T.
Open Datasets Yes For image classification, we trained and tested our model on Image Net-1K dataset (Deng et al., 2009). For object detection and instance segmentation, we trained our model on MS COCO (Lin et al., 2014)... For semantic segmentation, we used the ADE20K dataset (Zhou et al., 2017).
Dataset Splits Yes For image classification, we trained and tested our model on Image Net-1K dataset (Deng et al., 2009). To allow for a fair comparison, all GC Vi T variants are trained by following training configurations of previous efforts (Liu et al., 2021; Yang et al., 2021b; Chu et al., 2021a). For object detection and instance segmentation, we trained our model on MS COCO (Lin et al., 2014). For semantic segmentation, we used the ADE20K dataset (Zhou et al., 2017).
Hardware Specification Yes GC Vi T models were trained using four computational nodes with 32 NVIDIA A100 GPUs. Object detection and instance segmentation models as well as semantic segmentation models were trained using one computational node with 8 NVIDIA A40 GPUs.
Software Dependencies No The paper mentions software like `timm package (Wightman, 2019)`, `mmdetection (Chen et al., 2019)`, and `mmsegmentation (Contributors, 2020)`, but does not provide specific version numbers for these packages, which is required for reproducibility.
Experiment Setup Yes Specifically, all models are trained with the Adam W (Kingma & Ba, 2014) optimizer for 300 epochs with an initial learning rate of 0.001, weight decay of 0.05, cosine decay scheduler and 20 warm-up and cool-down epochs, respectively. For object detection and instance segmentation, we trained our model on MS COCO (Lin et al., 2014) with DINO (He et al., 2017) and a Mask-RCNN (He et al., 2017) heads, using 3 LR schedule with an initial learning rate of 0.0001, a batch size of 16 and weight decay of 0.05.