Scaled ReLU Matters for Training Vision Transformers
Authors: Pichao Wang, Xue Wang, Hao Luo, Jingkai Zhou, Zhipeng Zhou, Fan Wang, Hao Li, Rong Jin2495-2503
AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We verify, both theoretically and empirically, that scaled Re LU in conv-stem not only improves training stabilization, but also increases the diversity of patch tokens, thus boosting peak performance with a large margin via adding few parameters and flops. In addition, extensive experiments are conducted to demonstrate that previous Vi Ts are far from being well trained, further showing that Vi Ts have great potential to be a better substitute of CNNs. |
| Researcher Affiliation | Industry | Alibaba Group {pichao.wang,xue.w}@alibaba-inc.com |
| Pseudocode | Yes | The detailed configurations are shown in Algorithm 1 of supplemental material. ... the full implementation is shown in supplemental material of Algorithm 2 |
| Open Source Code | No | The paper does not provide concrete access to source code (specific repository link, explicit code release statement, or code in supplementary materials) for the methodology described in this paper. |
| Open Datasets | Yes | The Image Net1k (Russakovsky et al. 2015) is adopted for standard training and validation. It contains 1.3 million images in the training set and 50K images in the validation set, covering 1000 object classes. ... We fine-tune the DINO-S/16 shown in Table 2 on Market1501 (Zheng et al. 2015) and MSMT17 (Wei et al. 2018) datasets. |
| Dataset Splits | Yes | The Image Net1k (Russakovsky et al. 2015) is adopted for standard training and validation. It contains 1.3 million images in the training set and 50K images in the validation set, covering 1000 object classes. |
| Hardware Specification | No | The paper mentions 'The batchsize is 1024 for 8 GPUs' but does not provide specific GPU models, CPU models, or other detailed computer specifications for running its experiments. |
| Software Dependencies | No | The paper mentions 'Adam W optimizer' and 'SAM optimizer' but does not provide specific ancillary software details, such as library or solver names with version numbers, needed to replicate the experiment. |
| Experiment Setup | Yes | The batchsize is 1024 for 8 GPUs, and the results are as shown in Table 1. From the Table we can see that conv-stem based model is capable with more volatile training environment: with patchify-stem, Vi Tp can not support larger learning rate (1e-3) using Adam W optimizer... For Dei T and VOLO, we follow the official implementation and training settings, only modifying the parameters listed in the head of Table 2; for DINO, we follow the training settings for 100 epoch and show the linear evaluation results as top-1 accuracy. All models are trained with the baseline learning rate (1.6e3) and a larger learning rate (5e-2). |