CrossFormer: A Versatile Vision Transformer Hinging on Cross-scale Attention
Authors: Wenxiao Wang, Lu Yao, Long Chen, Binbin Lin, Deng Cai, Xiaofei He, Wei Liu
ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments show that Cross Former outperforms the other vision transformers on image classification, object detection, instance segmentation, and semantic segmentation tasks. |
| Researcher Affiliation | Collaboration | 1State Key Lab of CAD & CG, Zhejiang University 2Data Platform, Tencent 3Columbia University 4School of Software Technology, Zhejiang University |
| Pseudocode | Yes | Algorithm 1 LSDA code (Py Torch-like) |
| Open Source Code | Yes | 1The code has been released: https://github.com/cheerss/Cross Former |
| Open Datasets | Yes | The experiments on image classification are done with the Image Net (Russakovsky et al., 2015) dataset. The experiments on object detection and instance segmentation are both done on the COCO 2017 dataset (Lin et al., 2014). ADE20K (Zhou et al., 2017) is used as the benchmark for semantic segmentation. |
| Dataset Splits | Yes | The models are trained on 1.28M training images and tested on 50K validation images. (ImageNet) and COCO 2017 dataset (Lin et al., 2014), which contains 118K training and 5K val images. (COCO) |
| Hardware Specification | Yes | The batch size is 1,024 split on 8 V100 GPUs. |
| Software Dependencies | No | The paper mentions software like Adam W, MMDetection, and MMSegmentation, but does not provide specific version numbers for these or other key software dependencies required for reproduction. |
| Experiment Setup | Yes | In particular, we use an Adam W (Kingma & Ba, 2015) optimizer training for 300 epochs with a cosine decay learning rate scheduler, and 20 epochs of linear warm-up are used. The batch size is 1,024 split on 8 V100 GPUs. An initial learning rate of 0.001 and a weight decay of 0.05 are used. Besides, we use drop path rate of 0.1, 0.2, 0.3, 0.5 for Cross Former-T, Cross Former-S, Cross Former-B, Cross Former-L, respectively. Further, Similar to Swin (Liu et al., 2021b), Rand Augment (Cubuk et al., 2020), Mixup (Zhang et al., 2018a), Cutmix (Yun et al., 2019), random erasing (Zhong et al., 2020), and stochastic depth (Huang et al., 2016) are used for data augmentation. |