Global Context Vision Transformers
Authors: Ali Hatamizadeh, Hongxu Yin, Greg Heinrich, Jan Kautz, Pavlo Molchanov
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our proposed GC Vi T achieves state-of-the-art results across image classification, object detection and semantic segmentation tasks. On Image Net-1K dataset for classification, the variants of GC Vi T with 51M, 90M and 201M parameters achieve 84.3%, 85.0% and 85.7% Top-1 accuracy, respectively, at 224 224 image resolution and without any pre-training, hence surpassing comparably-sized prior art such as CNN-based Conv Ne Xt and Vi Tbased Max Vi T and Swin Transformer by a large margin. Pre-trained GC Vi T backbones in downstream tasks of object detection, instance segmentation, and semantic segmentation using MS COCO and ADE20K datasets outperform prior work consistently. |
| Researcher Affiliation | Industry | 1NVIDIA. Correspondence to: Ali Hatamizadeh <ahatamizadeh@nvidia.com>. |
| Pseudocode | Yes | Algorithm. 1 Global Attention Pseudocode # Input/output shape: (B*, N, C); # B*: Aggregated Batch Size; H: Height; # W: Width; C: dim; q_g: Global Token; # F: Num Attention Head; N: H x W. def init(): f = nn.Linear(C, 2*C) softmax = nn.Softmax(dim=-1) def forward(x, q_g): B*, N, C = x.shape B, C, h, w = q_g.shape kv = f(x).reshape(B*, N, 2, F, C // F) kv = kv.permute(2, 0, 3, 1, 4) k, v = split(kv, (1, 1), 0) q_g = q_g.repeat(1, B* // B, 1, 1) q_g = q_g.reshape(B*, F, N, C // F) qk = matmul(q_g,k.transpose(-2, -1)) attn = softmax(qk) return matmul(attn, v).reshape(B*, N, C) |
| Open Source Code | Yes | Code is available at https://github.com/NVlabs/GCVi T. |
| Open Datasets | Yes | For image classification, we trained and tested our model on Image Net-1K dataset (Deng et al., 2009). For object detection and instance segmentation, we trained our model on MS COCO (Lin et al., 2014)... For semantic segmentation, we used the ADE20K dataset (Zhou et al., 2017). |
| Dataset Splits | Yes | For image classification, we trained and tested our model on Image Net-1K dataset (Deng et al., 2009). To allow for a fair comparison, all GC Vi T variants are trained by following training configurations of previous efforts (Liu et al., 2021; Yang et al., 2021b; Chu et al., 2021a). For object detection and instance segmentation, we trained our model on MS COCO (Lin et al., 2014). For semantic segmentation, we used the ADE20K dataset (Zhou et al., 2017). |
| Hardware Specification | Yes | GC Vi T models were trained using four computational nodes with 32 NVIDIA A100 GPUs. Object detection and instance segmentation models as well as semantic segmentation models were trained using one computational node with 8 NVIDIA A40 GPUs. |
| Software Dependencies | No | The paper mentions software like `timm package (Wightman, 2019)`, `mmdetection (Chen et al., 2019)`, and `mmsegmentation (Contributors, 2020)`, but does not provide specific version numbers for these packages, which is required for reproducibility. |
| Experiment Setup | Yes | Specifically, all models are trained with the Adam W (Kingma & Ba, 2014) optimizer for 300 epochs with an initial learning rate of 0.001, weight decay of 0.05, cosine decay scheduler and 20 warm-up and cool-down epochs, respectively. For object detection and instance segmentation, we trained our model on MS COCO (Lin et al., 2014) with DINO (He et al., 2017) and a Mask-RCNN (He et al., 2017) heads, using 3 LR schedule with an initial learning rate of 0.0001, a batch size of 16 and weight decay of 0.05. |