RegionViT: Regional-to-Local Attention for Vision Transformers

Authors: Chun-Fu Chen, Rameswar Panda, Quanfu Fan

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on four vision tasks, including image classification, object and keypoint detection, semantics segmentation and action recognition, show that our approach outperforms or is on par with state-of-the-art Vi T variants including many concurrent works.
Researcher Affiliation Industry Chun-Fu (Richard) Chen, Rameswar Panda, Quanfu Fan MIT-IBM Watson AI Lab chenrich@us.ibm.com, rpanda@ibm.com, qfan@us.ibm.com
Pseudocode No The paper describes methods with mathematical equations and diagrams, but does not include structured pseudocode or algorithm blocks.
Open Source Code Yes Our source codes and models are available at https://github.com/IBM/Region Vi T.
Open Datasets Yes Datasets. We use Image Net1K (Deng et al., 2009) (IN1K) and Image Net21K (Deng et al., 2009) (IN21K) to validate our method. Image Net1K contains 1.28 million training images and 50k validation images over 1k classes, and Image Net21K is a large-scale dataset that consists of around 14 million images over 21,841 classes. We use all images for training and then finetune the model on Image Net1K.
Dataset Splits Yes Datasets. We use Image Net1K (Deng et al., 2009) (IN1K) and Image Net21K (Deng et al., 2009) (IN21K) to validate our method. Image Net1K contains 1.28 million training images and 50k validation images over 1k classes, and Image Net21K is a large-scale dataset that consists of around 14 million images over 21,841 classes. We use all images for training and then finetune the model on Image Net1K.
Hardware Specification No The paper mentions training models with 32 GPUs and 8 GPUs (Table A1, A2, A3), but does not specify the model or type of these GPUs or any other hardware components like CPUs or memory.
Software Dependencies No The paper mentions 'Adam W' optimizer and 'Detectron2' framework, but does not provide specific version numbers for any software or libraries.
Experiment Setup Yes We follow Dei T (Touvron et al., 2020) to train our models on IN1K except that we use batch size 4,096 with a base learning rate 0.004 and the warm-up epochs is 50. We adopt the Adam W (Loshchilov & Hutter, 2019) optimizer with cosine learning rate scheduler (Loshchilov & Hutter, 2017). We apply Mixup (Zhang et al., 2018), Cut Mix (Yun et al., 2019), Random Erasing (Zhong et al., 2020), label smoothing (Szegedy et al., 2016), Rand Augment (Cubuk et al., 2020) and instance repetition (Hoffer et al., 2020). During training, we random cropped a 224 224 region and take a 224 224 center crop after resizing the shorter side to 256 for evaluation. We used a similar setting for IN21K and transfer learning, and more details can be found in Section A.1.