reproducibilityindex.ai

Focal Attention for Long-Range Interactions in Vision Transformers

Authors: Jianwei Yang, Chunyuan Li, Pengchuan Zhang, Xiyang Dai, Bin Xiao, Lu Yuan, Jianfeng Gao

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our Focal Transformer models with a moderate size of 51.1M and a large size of 89.8M achieve 83.6% and 84.0% Top-1 accuracy, respectively, on Image Net classiﬁcation at 224 224. When employed as the backbones, Focal Transformers achieve consistent and substantial improvements over the current So TA Swin Transformers [43] across 6 different object detection methods. Our largest Focal Transformer yields 58.7/59.0 box m APs and 50.9/51.3 mask m APs on COCO mini-val/test-dev, and 55.4 m Io U on ADE20K for semantic segmentation, creating new So TA on three of the most challenging computer vision tasks. Our code is available at: https://github. com/microsoft/Focal-Transformer.
Researcher Affiliation	Industry	Jianwei Yang1 Chunyuan Li1 Pengchuan Zhang1 Xiyang Dai2 Bin Xiao2 Lu Yuan2 Jianfeng Gao1 1Microsoft Research at Redmond, 2Microsoft Cloud + AI {jianwyan,chunyl,penzhan,xidai,bixi,luyuan,jfgao}@microsoft.com
Pseudocode	No	The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code	Yes	Our code is available at: https://github. com/microsoft/Focal-Transformer.
Open Datasets	Yes	We compare different methods on Image Net-1K [19]. We benchmark our models on object detection with COCO 2017 [42]. We benchmark our methods on ADE20K [83].
Dataset Splits	Yes	All models are trained for 300 epochs with batch size 1024. During training, we crop images randomly to 224 224, while a center crop is used during evaluation on the validation set. All models are trained on the 118k training images and the results are reported on 5K validation set.
Hardware Specification	No	The paper does not explicitly describe the specific hardware (e.g., GPU models, CPU types) used to run its experiments.
Software Dependencies	No	The paper mentions using 'Adam W [44]' as an optimizer, but it does not provide specific version numbers for any software dependencies like programming languages, libraries, or frameworks.
Experiment Setup	Yes	All models are trained for 300 epochs with batch size 1024. The initial learning rate is set to 10 3 with 20 epochs of linear warm-up starting from 10 5. For optimization, we use Adam W [44] as the optimizer with a cosine learning rate scheduler. The weight decay is set to 0.05 and the maximal gradient norm is clipped to 5.0. We use the same set of data augmentation and regularization strategies used in [55] after excluding random erasing [82], repeated augmentation [4, 34] and exponential moving average (EMA) [48]. The stochastic depth drop rates are set to 0.2, 0.2 and 0.3 for our tiny, small and base models, respectively.