reproducibilityindex.ai

Discrete Representations Strengthen Vision Transformer Robustness

Authors: Chengzhi Mao, Lu Jiang, Mostafa Dehghani, Carl Vondrick, Rahul Sukthankar, Irfan Essa

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results demonstrate that adding discrete representation on four architecture variants strengthens Vi T robustness by up to 12% across seven Image Net robustness benchmarks while maintaining the performance on Image Net.
Researcher Affiliation	Collaboration	Chengzhi Mao2 , Lu Jiang1, Mostafa Dehghani1, Carl Vondrick2, Rahul Sukthankar1, Irfan Essa1,3 1 Google Research 2 Computer Science, Columbia University 3 School of Interactive Computing, Georgia Insitute of Technology
Pseudocode	Yes	1 # x: input image mini-batch; pixel_embed: pixel 2 # embeddings of x. vqgan.encoder and codebook are 3 # initialized form the pretraining. 4 import jax.numpy as np 5 discrete_token = jax.lax.stop_gradient(vqgan.encoder(x)) 6 discrete_embed = np.dot(discrete_token, codebook) 7 tokens = np.concatenate( 8 [discrete_embed, pixel_embed], dim=2) 9 predictions = Transformer Encoder(tokens) (a) Pseudo JAX code
Open Source Code	No	We will release our model and code to assist in comparisons and to support other researchers in reproducing our experiments and results.
Open Datasets	Yes	Datasets. In the experiments, we train all the models, including ours, on Image Net 2012 or Image Net-21K under the same training settings, where we use identical training data, batch size, and learning rate schedule, etc. Image Net (Deng et al., 2009) is the standard validation set of ILSVRC2012.
Dataset Splits	Yes	Image Net (Deng et al., 2009) is the standard validation set of ILSVRC2012.
Hardware Specification	Yes	Our batchsize is 4096 and trained on 64 TPU cores.
Software Dependencies	No	We implement our model in Jax and optimize all models with Adam (Kingma & Ba, 2014).
Experiment Setup	Yes	Unless speciﬁed otherwise, the input images are resized to 224x224, trained with a batch size of 4,096, with a weight decay of 0.1. We use a linear learning rate warmup and cosine decay. On Image Net, the models are trained for 300 epochs... We start from a learning rate of 0.001 and train 300 epoch. We use linear warm up for 10k iterations and then a cosine annealing for the learning rate schedule. We use 224 224 resolution for the input image. We use Rand Aug with hyper-parameter (2,15) and mixup with α = 0.5, and we apply a dropout rate of 0.1 and stochastic block dropout of 0.1.