Conditional Positional Encodings for Vision Transformers

Authors: Xiangxiang Chu, Zhi Tian, Bo Zhang, Xinlong Wang, Chunhua Shen

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate that CPVT has visually similar attention maps compared to those with learned positional encodings and delivers outperforming results. Our Code is available at: https://git.io/CPVT. ... 4 EXPERIMENTS Datasets. Following Dei T (Touvron et al., 2020), we use ILSVRC-2012 Image Net dataset (Deng et al., 2009) with 1K classes and 1.3M images to train all our models. We report the results on the validation set with 50K images. ... Table 2. Direct evaluation on other resolutions without fine-tuning. ... Table 4. Comparison with Conv Nets and Transformers on Image Net and Image Net Real (Beyer et al., 2020).
Researcher Affiliation Collaboration Xiangxiang Chu1, Zhi Tian1, Bo Zhang1, Xinlong Wang2, Chunhua Shen3 1 Meituan Inc. 2 Beijing Academy of AI 3 Zhejiang University, China {chuxiangxiang, tianzhi02, zhangbo97}@meituan.com, xinlong.wang96@gmail.com, chunhua@me.com
Pseudocode Yes Algorithm 1 Py Torch snippet of PEG. import torch import torch.nn as nn class Vision Transformer:
Open Source Code Yes Our Code is available at: https://git.io/CPVT.
Open Datasets Yes Datasets. Following Dei T (Touvron et al., 2020), we use ILSVRC-2012 Image Net dataset (Deng et al., 2009) with 1K classes and 1.3M images to train all our models.
Dataset Splits Yes Datasets. Following Dei T (Touvron et al., 2020), we use ILSVRC-2012 Image Net dataset (Deng et al., 2009) with 1K classes and 1.3M images to train all our models. We report the results on the validation set with 50K images.
Hardware Specification Yes All experiments in this paper are performed on Tesla V100 machines. Training the tiny model for 300 epochs takes about 1.3 days on a single node with 8 V100 GPU cards. CPVT-S and CPVT-B take about 1.6 and 2.5 days, respectively. ... All the models (except for CPVT-B) are trained for 300 epochs with a global batch size of 2048 on Tesla V100 machines using Adam W optimizer (Loshchilov & Hutter, 2019).
Software Dependencies No Built on PEG, we present Conditional Position encoding Vision Transformer (CPVT). We demonstrate that CPVT has visually similar attention maps compared to those with learned positional encodings and delivers outperforming results. Our Code is available at: https://git.io/CPVT. ... Adam W optimizer (Loshchilov & Hutter, 2019). ... effortlessly implemented by the modern deep learning frameworks (Paszke et al., 2019; Abadi et al., 2016; Chen et al., 2015)...
Experiment Setup Yes Training details All the models (except for CPVT-B) are trained for 300 epochs with a global batch size of 2048 on Tesla V100 machines using Adam W optimizer (Loshchilov & Hutter, 2019). We do not tune the hyper-parameters and strictly comply with the settings in Dei T (Touvron et al., 2020). The learning rate is scaled with this formula lrscale = 0.0005 Batch Sizeglobal/512. The detailed hyperparameters are in the B.2.