Scaled ReLU Matters for Training Vision Transformers

Authors: Pichao Wang, Xue Wang, Hao Luo, Jingkai Zhou, Zhipeng Zhou, Fan Wang, Hao Li, Rong Jin2495-2503

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We verify, both theoretically and empirically, that scaled Re LU in conv-stem not only improves training stabilization, but also increases the diversity of patch tokens, thus boosting peak performance with a large margin via adding few parameters and flops. In addition, extensive experiments are conducted to demonstrate that previous Vi Ts are far from being well trained, further showing that Vi Ts have great potential to be a better substitute of CNNs.
Researcher Affiliation Industry Alibaba Group {pichao.wang,xue.w}@alibaba-inc.com
Pseudocode Yes The detailed configurations are shown in Algorithm 1 of supplemental material. ... the full implementation is shown in supplemental material of Algorithm 2
Open Source Code No The paper does not provide concrete access to source code (specific repository link, explicit code release statement, or code in supplementary materials) for the methodology described in this paper.
Open Datasets Yes The Image Net1k (Russakovsky et al. 2015) is adopted for standard training and validation. It contains 1.3 million images in the training set and 50K images in the validation set, covering 1000 object classes. ... We fine-tune the DINO-S/16 shown in Table 2 on Market1501 (Zheng et al. 2015) and MSMT17 (Wei et al. 2018) datasets.
Dataset Splits Yes The Image Net1k (Russakovsky et al. 2015) is adopted for standard training and validation. It contains 1.3 million images in the training set and 50K images in the validation set, covering 1000 object classes.
Hardware Specification No The paper mentions 'The batchsize is 1024 for 8 GPUs' but does not provide specific GPU models, CPU models, or other detailed computer specifications for running its experiments.
Software Dependencies No The paper mentions 'Adam W optimizer' and 'SAM optimizer' but does not provide specific ancillary software details, such as library or solver names with version numbers, needed to replicate the experiment.
Experiment Setup Yes The batchsize is 1024 for 8 GPUs, and the results are as shown in Table 1. From the Table we can see that conv-stem based model is capable with more volatile training environment: with patchify-stem, Vi Tp can not support larger learning rate (1e-3) using Adam W optimizer... For Dei T and VOLO, we follow the official implementation and training settings, only modifying the parameters listed in the head of Table 2; for DINO, we follow the training settings for 100 epoch and show the linear evaluation results as top-1 accuracy. All models are trained with the baseline learning rate (1.6e3) and a larger learning rate (5e-2).