RegionViT: Regional-to-Local Attention for Vision Transformers
Authors: Chun-Fu Chen, Rameswar Panda, Quanfu Fan
ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on four vision tasks, including image classification, object and keypoint detection, semantics segmentation and action recognition, show that our approach outperforms or is on par with state-of-the-art Vi T variants including many concurrent works. |
| Researcher Affiliation | Industry | Chun-Fu (Richard) Chen, Rameswar Panda, Quanfu Fan MIT-IBM Watson AI Lab chenrich@us.ibm.com, rpanda@ibm.com, qfan@us.ibm.com |
| Pseudocode | No | The paper describes methods with mathematical equations and diagrams, but does not include structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our source codes and models are available at https://github.com/IBM/Region Vi T. |
| Open Datasets | Yes | Datasets. We use Image Net1K (Deng et al., 2009) (IN1K) and Image Net21K (Deng et al., 2009) (IN21K) to validate our method. Image Net1K contains 1.28 million training images and 50k validation images over 1k classes, and Image Net21K is a large-scale dataset that consists of around 14 million images over 21,841 classes. We use all images for training and then finetune the model on Image Net1K. |
| Dataset Splits | Yes | Datasets. We use Image Net1K (Deng et al., 2009) (IN1K) and Image Net21K (Deng et al., 2009) (IN21K) to validate our method. Image Net1K contains 1.28 million training images and 50k validation images over 1k classes, and Image Net21K is a large-scale dataset that consists of around 14 million images over 21,841 classes. We use all images for training and then finetune the model on Image Net1K. |
| Hardware Specification | No | The paper mentions training models with 32 GPUs and 8 GPUs (Table A1, A2, A3), but does not specify the model or type of these GPUs or any other hardware components like CPUs or memory. |
| Software Dependencies | No | The paper mentions 'Adam W' optimizer and 'Detectron2' framework, but does not provide specific version numbers for any software or libraries. |
| Experiment Setup | Yes | We follow Dei T (Touvron et al., 2020) to train our models on IN1K except that we use batch size 4,096 with a base learning rate 0.004 and the warm-up epochs is 50. We adopt the Adam W (Loshchilov & Hutter, 2019) optimizer with cosine learning rate scheduler (Loshchilov & Hutter, 2017). We apply Mixup (Zhang et al., 2018), Cut Mix (Yun et al., 2019), Random Erasing (Zhong et al., 2020), label smoothing (Szegedy et al., 2016), Rand Augment (Cubuk et al., 2020) and instance repetition (Hoffer et al., 2020). During training, we random cropped a 224 224 region and take a 224 224 center crop after resizing the shorter side to 256 for evaluation. We used a similar setting for IN21K and transfer learning, and more details can be found in Section A.1. |