ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias
Authors: Yufei Xu, Qiming ZHANG, Jing Zhang, Dacheng Tao
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on Image Net as well as downstream tasks prove the superiority of Vi TAE over the baseline transformer and concurrent works. |
| Researcher Affiliation | Collaboration | 1The University of Sydney, Australia, 2JD Explore Academy, China |
| Pseudocode | No | The paper describes the architecture and operations using mathematical equations and text, but no explicit pseudocode or algorithm blocks are provided. |
| Open Source Code | No | Source code and pretrained models will be available at code. |
| Open Datasets | Yes | We train and test the proposed Vi TAE model on the standard Image Net [38] dataset, which contains about 1.3 million images and covers 1k classes. |
| Dataset Splits | Yes | We train and test the proposed Vi TAE model on the standard Image Net [38] dataset, which contains about 1.3 million images and covers 1k classes. To validate the effectiveness of the introduced intrinsic IBs in improving data efficiency and training efficiency, we compare our Vi TAE with T2T-Vi T at different training settings: (a) training them using 20%, 60%, and 100% Image Net training set for equivalent 100 epochs on the full Image Net training set, e.g., we employ 5 times epochs when using 20% data for training compared with using 100% data; and (b) training them using the full Image Net training set for 100, 200, and 300 epochs respectively. |
| Hardware Specification | Yes | The results of our models can be found in Table 2, where all the models are trained for 300 epochs on 8 V100 GPUs. |
| Software Dependencies | No | The models are built on Py Torch [57] and TIMM [82]. Specific version numbers for these software components are not provided. |
| Experiment Setup | Yes | Unless explicitly stated, the image size during training is set to 224 224. We use the Adam W [48] optimizer with the cosine learning rate scheduler and uses the data augmentation strategy exactly the same as T2T [93] for a fair comparison, regarding the training strategies and the size of models. We use a batch size of 512 for training all our models and set the initial learning rate to be 5e-4. The results of our models can be found in Table 2, where all the models are trained for 300 epochs on 8 V100 GPUs. |