reproducibilityindex.ai

ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias

Authors: Yufei Xu, Qiming ZHANG, Jing Zhang, Dacheng Tao

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on Image Net as well as downstream tasks prove the superiority of Vi TAE over the baseline transformer and concurrent works.
Researcher Affiliation	Collaboration	1The University of Sydney, Australia, 2JD Explore Academy, China
Pseudocode	No	The paper describes the architecture and operations using mathematical equations and text, but no explicit pseudocode or algorithm blocks are provided.
Open Source Code	No	Source code and pretrained models will be available at code.
Open Datasets	Yes	We train and test the proposed Vi TAE model on the standard Image Net [38] dataset, which contains about 1.3 million images and covers 1k classes.
Dataset Splits	Yes	We train and test the proposed Vi TAE model on the standard Image Net [38] dataset, which contains about 1.3 million images and covers 1k classes. To validate the effectiveness of the introduced intrinsic IBs in improving data efﬁciency and training efﬁciency, we compare our Vi TAE with T2T-Vi T at different training settings: (a) training them using 20%, 60%, and 100% Image Net training set for equivalent 100 epochs on the full Image Net training set, e.g., we employ 5 times epochs when using 20% data for training compared with using 100% data; and (b) training them using the full Image Net training set for 100, 200, and 300 epochs respectively.
Hardware Specification	Yes	The results of our models can be found in Table 2, where all the models are trained for 300 epochs on 8 V100 GPUs.
Software Dependencies	No	The models are built on Py Torch [57] and TIMM [82]. Specific version numbers for these software components are not provided.
Experiment Setup	Yes	Unless explicitly stated, the image size during training is set to 224 224. We use the Adam W [48] optimizer with the cosine learning rate scheduler and uses the data augmentation strategy exactly the same as T2T [93] for a fair comparison, regarding the training strategies and the size of models. We use a batch size of 512 for training all our models and set the initial learning rate to be 5e-4. The results of our models can be found in Table 2, where all the models are trained for 300 epochs on 8 V100 GPUs.