reproducibilityindex.ai

Conditional Positional Encodings for Vision Transformers

Authors: Xiangxiang Chu, Zhi Tian, Bo Zhang, Xinlong Wang, Chunhua Shen

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate that CPVT has visually similar attention maps compared to those with learned positional encodings and delivers outperforming results. Our Code is available at: https://git.io/CPVT. ... 4 EXPERIMENTS Datasets. Following Dei T (Touvron et al., 2020), we use ILSVRC-2012 Image Net dataset (Deng et al., 2009) with 1K classes and 1.3M images to train all our models. We report the results on the validation set with 50K images. ... Table 2. Direct evaluation on other resolutions without ﬁne-tuning. ... Table 4. Comparison with Conv Nets and Transformers on Image Net and Image Net Real (Beyer et al., 2020).
Researcher Affiliation	Collaboration	Xiangxiang Chu1, Zhi Tian1, Bo Zhang1, Xinlong Wang2, Chunhua Shen3 1 Meituan Inc. 2 Beijing Academy of AI 3 Zhejiang University, China {chuxiangxiang, tianzhi02, zhangbo97}@meituan.com, xinlong.wang96@gmail.com, chunhua@me.com
Pseudocode	Yes	Algorithm 1 Py Torch snippet of PEG. import torch import torch.nn as nn class Vision Transformer:
Open Source Code	Yes	Our Code is available at: https://git.io/CPVT.
Open Datasets	Yes	Datasets. Following Dei T (Touvron et al., 2020), we use ILSVRC-2012 Image Net dataset (Deng et al., 2009) with 1K classes and 1.3M images to train all our models.
Dataset Splits	Yes	Datasets. Following Dei T (Touvron et al., 2020), we use ILSVRC-2012 Image Net dataset (Deng et al., 2009) with 1K classes and 1.3M images to train all our models. We report the results on the validation set with 50K images.
Hardware Specification	Yes	All experiments in this paper are performed on Tesla V100 machines. Training the tiny model for 300 epochs takes about 1.3 days on a single node with 8 V100 GPU cards. CPVT-S and CPVT-B take about 1.6 and 2.5 days, respectively. ... All the models (except for CPVT-B) are trained for 300 epochs with a global batch size of 2048 on Tesla V100 machines using Adam W optimizer (Loshchilov & Hutter, 2019).
Software Dependencies	No	Built on PEG, we present Conditional Position encoding Vision Transformer (CPVT). We demonstrate that CPVT has visually similar attention maps compared to those with learned positional encodings and delivers outperforming results. Our Code is available at: https://git.io/CPVT. ... Adam W optimizer (Loshchilov & Hutter, 2019). ... effortlessly implemented by the modern deep learning frameworks (Paszke et al., 2019; Abadi et al., 2016; Chen et al., 2015)...
Experiment Setup	Yes	Training details All the models (except for CPVT-B) are trained for 300 epochs with a global batch size of 2048 on Tesla V100 machines using Adam W optimizer (Loshchilov & Hutter, 2019). We do not tune the hyper-parameters and strictly comply with the settings in Dei T (Touvron et al., 2020). The learning rate is scaled with this formula lrscale = 0.0005 Batch Sizeglobal/512. The detailed hyperparameters are in the B.2.