DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification

Authors: Yongming Rao, Wenliang Zhao, Benlin Liu, Jiwen Lu, Jie Zhou, Cho-Jui Hsieh

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we will demonstrate the superiority of the proposed Dynamic Vi T through extensive experiments.
Researcher Affiliation Academia Yongming Rao1 Wenliang Zhao1 Benlin Liu2,3 Jiwen Lu1 Jie Zhou1 Cho-Jui Hsieh2 1 Tsinghua University 2 UCLA 3 University of Washington
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes Code is available at https://github.com/ raoyongming/Dynamic Vi T.
Open Datasets Yes We illustrate the effectiveness of our method on Image Net using Dei T [25] and LV-Vi T [16] as backbone. ... In all of our experiments, we fix the number of sparsification stages S = 3 and apply the target keeping ratio ρ as a geometric sequence [ρ, ρ2, ρ3] where ρ ranges from (0, 1).
Dataset Splits Yes We then use the Dynamic Vi T to generate the decisions for all the images in the Image Net validation set and compute the keep probability of each token in all three stages, as shown in Figure 6. ... (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes]
Hardware Specification Yes The throughput is measured on a single NVIDIA RTX 3090 GPU with batch size fixed to 32. ... All of our models are trained on a single machine with 8 GPUs.
Software Dependencies No The paper mentions training on GPUs but does not specify software dependencies like specific versions of deep learning frameworks (e.g., PyTorch, TensorFlow) or CUDA versions.
Experiment Setup Yes We use the pre-trained vision transformer models to initialize the backbone models and jointly train the whole model for 30 epochs. We set the learning rate of the prediction module to batch size 1024 0.001 and use 0.01 learning rate for the backbone model. We fix the weights of the backbone models in the first 5 epochs. All of our models are trained on a single machine with 8 GPUs.