reproducibilityindex.ai

DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification

Authors: Yongming Rao, Wenliang Zhao, Benlin Liu, Jiwen Lu, Jie Zhou, Cho-Jui Hsieh

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we will demonstrate the superiority of the proposed Dynamic Vi T through extensive experiments.
Researcher Affiliation	Academia	Yongming Rao1 Wenliang Zhao1 Benlin Liu2,3 Jiwen Lu1 Jie Zhou1 Cho-Jui Hsieh2 1 Tsinghua University 2 UCLA 3 University of Washington
Pseudocode	No	The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code	Yes	Code is available at https://github.com/ raoyongming/Dynamic Vi T.
Open Datasets	Yes	We illustrate the effectiveness of our method on Image Net using Dei T [25] and LV-Vi T [16] as backbone. ... In all of our experiments, we ﬁx the number of sparsiﬁcation stages S = 3 and apply the target keeping ratio ρ as a geometric sequence [ρ, ρ2, ρ3] where ρ ranges from (0, 1).
Dataset Splits	Yes	We then use the Dynamic Vi T to generate the decisions for all the images in the Image Net validation set and compute the keep probability of each token in all three stages, as shown in Figure 6. ... (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes]
Hardware Specification	Yes	The throughput is measured on a single NVIDIA RTX 3090 GPU with batch size ﬁxed to 32. ... All of our models are trained on a single machine with 8 GPUs.
Software Dependencies	No	The paper mentions training on GPUs but does not specify software dependencies like specific versions of deep learning frameworks (e.g., PyTorch, TensorFlow) or CUDA versions.
Experiment Setup	Yes	We use the pre-trained vision transformer models to initialize the backbone models and jointly train the whole model for 30 epochs. We set the learning rate of the prediction module to batch size 1024 0.001 and use 0.01 learning rate for the backbone model. We ﬁx the weights of the backbone models in the ﬁrst 5 epochs. All of our models are trained on a single machine with 8 GPUs.