Pale Transformer: A General Vision Transformer Backbone with Pale-Shaped Attention
Authors: Sitong Wu, Tianyi Wu, Haoru Tan, Guodong Guo2731-2739
AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Based on the PS-Attention, we develop a general Vision Transformer backbone with a hierarchical architecture, named Pale Transformer, which achieves 83.4%, 84.3%, and 84.9% Top-1 accuracy with the model size of 22M, 48M, and 85M respectively for 224 224 Image Net-1K classification, outperforming the previous Vision Transformer backbones. For downstream tasks, our Pale Transformer backbone performs better than the recent state-of-the-art CSWin Transformer by a large margin on ADE20K semantic segmentation and COCO object detection & instance segmentation. |
| Researcher Affiliation | Collaboration | Sitong Wu1,2, Tianyi Wu1,2, Haoru Tan3, Guodong Guo1,2 * 1Institute of Deep Learning, Baidu Research, Beijing, China 2National Engineering Laboratory for Deep Learning Technology and Application, Beijing, China 3School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China |
| Pseudocode | No | The paper presents mathematical formulations and architectural descriptions but does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code will be released on https://github.com/BRIDL/Paddle Vi T. |
| Open Datasets | Yes | We first compare our Pale Transformer with the state-of-the-art Transformer backbones on Image Net-1K (Russakovsky et al. 2015) for image classification. To further demonstrate the effectiveness and generalization of our backbone, we conduct experiments on ADE20K (Zhou et al. 2019) for semantic segmentation, and COCO (Lin et al. 2014) for object detection & instance segmentation. |
| Dataset Splits | No | The paper mentions using the 'Image Net-1K validation set' for evaluation and provides training settings like '300 epochs' and 'total batch size of 1024'. However, it does not specify explicit percentages or counts for training, validation, and test splits, nor does it refer to a predefined split with specific details. |
| Hardware Specification | Yes | All the variants are trained from scratch for 300 epochs on 8 V100 GPUs with a total batch size of 1024. |
| Software Dependencies | No | The paper does not provide specific version numbers for any software components, libraries, or solvers used in the experiments. |
| Experiment Setup | Yes | All the variants are trained from scratch for 300 epochs on 8 V100 GPUs with a total batch size of 1024. Both the training and evaluation are conducted with the input size of 224 224 on Image Net-1K dataset. ...Note that all variants have the same depth with [2, 2, 16, 2] in four stages. In each stage of these variants, we set the pale size sr = sc = Si = 7, and use the same MLP expansion ratio of Ri = 4. |