DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification
Authors: Yongming Rao, Wenliang Zhao, Benlin Liu, Jiwen Lu, Jie Zhou, Cho-Jui Hsieh
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we will demonstrate the superiority of the proposed Dynamic Vi T through extensive experiments. |
| Researcher Affiliation | Academia | Yongming Rao1 Wenliang Zhao1 Benlin Liu2,3 Jiwen Lu1 Jie Zhou1 Cho-Jui Hsieh2 1 Tsinghua University 2 UCLA 3 University of Washington |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code is available at https://github.com/ raoyongming/Dynamic Vi T. |
| Open Datasets | Yes | We illustrate the effectiveness of our method on Image Net using Dei T [25] and LV-Vi T [16] as backbone. ... In all of our experiments, we fix the number of sparsification stages S = 3 and apply the target keeping ratio ρ as a geometric sequence [ρ, ρ2, ρ3] where ρ ranges from (0, 1). |
| Dataset Splits | Yes | We then use the Dynamic Vi T to generate the decisions for all the images in the Image Net validation set and compute the keep probability of each token in all three stages, as shown in Figure 6. ... (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] |
| Hardware Specification | Yes | The throughput is measured on a single NVIDIA RTX 3090 GPU with batch size fixed to 32. ... All of our models are trained on a single machine with 8 GPUs. |
| Software Dependencies | No | The paper mentions training on GPUs but does not specify software dependencies like specific versions of deep learning frameworks (e.g., PyTorch, TensorFlow) or CUDA versions. |
| Experiment Setup | Yes | We use the pre-trained vision transformer models to initialize the backbone models and jointly train the whole model for 30 epochs. We set the learning rate of the prediction module to batch size 1024 0.001 and use 0.01 learning rate for the backbone model. We fix the weights of the backbone models in the first 5 epochs. All of our models are trained on a single machine with 8 GPUs. |