Dynamic Token Normalization improves Vision Transformers
Authors: Wenqi Shao, Yixiao Ge, Zhaoyang Zhang, XUYUAN XU, Xiaogang Wang, Ying Shan, Ping Luo
ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments show that the transformer equipped with DTN consistently outperforms baseline model with minimal extra parameters and computational overhead. |
| Researcher Affiliation | Collaboration | 1 The Chinese University of Hong Kong 2 ARC Lab, Tencent PCG 3 AI Technology Center of Tencent Video 4 The University of Hong Kong |
| Pseudocode | Yes | Algorithm 1 Forward pass of DTN. |
| Open Source Code | Yes | Codes will be made public at https://github.com/wqshao126/DTN. |
| Open Datasets | Yes | Extensive experiment such as image classification on Image Net (Russakovsky et al., 2015), robustness on Image Net-C (Hendrycks & Dietterich, 2019), self-supervised pre-training on Vi Ts (Caron et al., 2021), List Ops on Long-Range Arena (Tay et al., 2021) show that DTN can achieve better performance with minimal extra parameters and marginal increase of computational overhead compared to existing approaches. |
| Dataset Splits | Yes | Image Net. We evaluate the performance of our proposed DTN using Vi T models with different sizes on Image Net, which consists of 1.28M training images and 50k validation images. |
| Hardware Specification | No | The paper mentions training on "all GPUs" but does not specify exact GPU/CPU models or other detailed hardware specifications. |
| Software Dependencies | No | The paper mentions using frameworks like MMDetection but does not provide specific version numbers for any key software components or libraries. |
| Experiment Setup | Yes | We train Vi T with our proposed DTN by following the training framework of Dei T (Touvron et al., 2021) where the Vi T models are trained with a total batch size of 1024 on all GPUs. We use Adam optimizer with a momentum of 0.9 and weight decay of 0.05. The cosine learning schedule is adopted with the initial learning rate of 0.0005. |