Focal Attention for Long-Range Interactions in Vision Transformers
Authors: Jianwei Yang, Chunyuan Li, Pengchuan Zhang, Xiyang Dai, Bin Xiao, Lu Yuan, Jianfeng Gao
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our Focal Transformer models with a moderate size of 51.1M and a large size of 89.8M achieve 83.6% and 84.0% Top-1 accuracy, respectively, on Image Net classification at 224 224. When employed as the backbones, Focal Transformers achieve consistent and substantial improvements over the current So TA Swin Transformers [43] across 6 different object detection methods. Our largest Focal Transformer yields 58.7/59.0 box m APs and 50.9/51.3 mask m APs on COCO mini-val/test-dev, and 55.4 m Io U on ADE20K for semantic segmentation, creating new So TA on three of the most challenging computer vision tasks. Our code is available at: https://github. com/microsoft/Focal-Transformer. |
| Researcher Affiliation | Industry | Jianwei Yang1 Chunyuan Li1 Pengchuan Zhang1 Xiyang Dai2 Bin Xiao2 Lu Yuan2 Jianfeng Gao1 1Microsoft Research at Redmond, 2Microsoft Cloud + AI {jianwyan,chunyl,penzhan,xidai,bixi,luyuan,jfgao}@microsoft.com |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code is available at: https://github. com/microsoft/Focal-Transformer. |
| Open Datasets | Yes | We compare different methods on Image Net-1K [19]. We benchmark our models on object detection with COCO 2017 [42]. We benchmark our methods on ADE20K [83]. |
| Dataset Splits | Yes | All models are trained for 300 epochs with batch size 1024. During training, we crop images randomly to 224 224, while a center crop is used during evaluation on the validation set. All models are trained on the 118k training images and the results are reported on 5K validation set. |
| Hardware Specification | No | The paper does not explicitly describe the specific hardware (e.g., GPU models, CPU types) used to run its experiments. |
| Software Dependencies | No | The paper mentions using 'Adam W [44]' as an optimizer, but it does not provide specific version numbers for any software dependencies like programming languages, libraries, or frameworks. |
| Experiment Setup | Yes | All models are trained for 300 epochs with batch size 1024. The initial learning rate is set to 10 3 with 20 epochs of linear warm-up starting from 10 5. For optimization, we use Adam W [44] as the optimizer with a cosine learning rate scheduler. The weight decay is set to 0.05 and the maximal gradient norm is clipped to 5.0. We use the same set of data augmentation and regularization strategies used in [55] after excluding random erasing [82], repeated augmentation [4, 34] and exponential moving average (EMA) [48]. The stochastic depth drop rates are set to 0.2, 0.2 and 0.3 for our tiny, small and base models, respectively. |