reproducibilityindex.ai

Scaled ReLU Matters for Training Vision Transformers

Authors: Pichao Wang, Xue Wang, Hao Luo, Jingkai Zhou, Zhipeng Zhou, Fan Wang, Hao Li, Rong Jin2495-2503

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We verify, both theoretically and empirically, that scaled Re LU in conv-stem not only improves training stabilization, but also increases the diversity of patch tokens, thus boosting peak performance with a large margin via adding few parameters and ﬂops. In addition, extensive experiments are conducted to demonstrate that previous Vi Ts are far from being well trained, further showing that Vi Ts have great potential to be a better substitute of CNNs.
Researcher Affiliation	Industry	Alibaba Group {pichao.wang,xue.w}@alibaba-inc.com
Pseudocode	Yes	The detailed conﬁgurations are shown in Algorithm 1 of supplemental material. ... the full implementation is shown in supplemental material of Algorithm 2
Open Source Code	No	The paper does not provide concrete access to source code (specific repository link, explicit code release statement, or code in supplementary materials) for the methodology described in this paper.
Open Datasets	Yes	The Image Net1k (Russakovsky et al. 2015) is adopted for standard training and validation. It contains 1.3 million images in the training set and 50K images in the validation set, covering 1000 object classes. ... We ﬁne-tune the DINO-S/16 shown in Table 2 on Market1501 (Zheng et al. 2015) and MSMT17 (Wei et al. 2018) datasets.
Dataset Splits	Yes	The Image Net1k (Russakovsky et al. 2015) is adopted for standard training and validation. It contains 1.3 million images in the training set and 50K images in the validation set, covering 1000 object classes.
Hardware Specification	No	The paper mentions 'The batchsize is 1024 for 8 GPUs' but does not provide specific GPU models, CPU models, or other detailed computer specifications for running its experiments.
Software Dependencies	No	The paper mentions 'Adam W optimizer' and 'SAM optimizer' but does not provide specific ancillary software details, such as library or solver names with version numbers, needed to replicate the experiment.
Experiment Setup	Yes	The batchsize is 1024 for 8 GPUs, and the results are as shown in Table 1. From the Table we can see that conv-stem based model is capable with more volatile training environment: with patchify-stem, Vi Tp can not support larger learning rate (1e-3) using Adam W optimizer... For Dei T and VOLO, we follow the ofﬁcial implementation and training settings, only modifying the parameters listed in the head of Table 2; for DINO, we follow the training settings for 100 epoch and show the linear evaluation results as top-1 accuracy. All models are trained with the baseline learning rate (1.6e3) and a larger learning rate (5e-2).