All Tokens Matter: Token Labeling for Training Better Vision Transformers

Authors: Zi-Hang Jiang, Qibin Hou, Li Yuan, Daquan Zhou, Yujun Shi, Xiaojie Jin, Anran Wang, Jiashi Feng

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments show that token labeling can clearly and consistently improve the performance of various Vi T models across a wide spectrum. For a vision transformer with 26M learnable parameters serving as an example, with token labeling, the model can achieve 84.4% Top-1 accuracy on Image Net. The result can be further increased to 86.4% by slightly scaling the model size up to 150M, delivering the minimal-sized model among previous models (250M+) reaching 86%. We also show that token labeling can clearly improve the generalization capability of the pretrained models on downstream tasks with dense prediction, such as semantic segmentation. Our code and model are publicly available at https://github.com/zihang Jiang/Token Labeling.
Researcher Affiliation Collaboration Zihang Jiang1 Qibin Hou2,1 Li Yuan3 Daquan Zhou1 Yujun Shi1 Xiaojie Jin4 Anran Wang4 Jiashi Feng4 1National University of Singapore 2Nankai University 3 Peking University 4Byte Dance
Pseudocode No The paper describes the Token Labeling Method conceptually and mathematically in Section 3, but it does not include any structured pseudocode or algorithm blocks.
Open Source Code Yes Our code and model are publicly available at https://github.com/zihang Jiang/Token Labeling.
Open Datasets Yes We evaluate our method on the Image Net [13] dataset. All experiments are built and conducted upon Py Torch [29] and the timm [42] library. We follow the standard training schedule and train our models on the Image Net dataset for 300 epochs. ... We use the NFNet-F6 [3] trained on Image Net with an 86.3% Top-1 accuracy as the machine annotator to generate dense score maps for the Image Net dataset, yielding a 1000-dimensional score map for each image for training.
Dataset Splits Yes Results in terms of Dei T-S/LV-Vi T-S Top-1 accuracy and training time for our token labeling, online knowledge distillation, and Re Label [49] are listed in Table 4, with number of utilized tokens also included for clear comparison. ... We run experiments on the widely-used ADE20K [58] dataset. ADE20K contains 25K images in total, including 20K images for training, 2K images for validation and 3K images for test, and covering 150 different foreground categories.
Hardware Specification Yes Training Time (8 V100) 63 hrs
Software Dependencies Yes All experiments are built and conducted upon Py Torch [29] and the timm [42] library. We use the Adam W optimizer [27]... We use the NFNet-F6 [3] trained on Image Net with an 86.3% Top-1 accuracy as the machine annotator... We take both FCN [26] and Uper Net [44] as our segmentation frameworks and use the mmseg toolbox to implement.
Experiment Setup Yes We follow the standard training schedule and train our models on the Image Net dataset for 300 epochs. Besides normal augmentations like Cut Out [57] and Rand Aug [10], we also explore the effect of applying Mix Up [52] and Cut Mix [48] together with our proposed token labeling. ... For optimization, by default, we use the Adam W optimizer [27] with a linear learning rate scaling strategy lr = 10 3 batch_size 640 and 5 10 2 weight decay rate. For Dropout regularization, we observe that for small models, using Dropout hurts the performance. ... As a result, we do not apply Dropout [32] and use Stochastic Depth [23] instead.