Chasing Sparsity in Vision Transformers: An End-to-End Exploration
Authors: Tianlong Chen, Yu Cheng, Zhe Gan, Lu Yuan, Lei Zhang, Zhangyang Wang
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive results on Image Net with diverse Vi T backbones validate the effectiveness of our proposals which obtain significantly reduced computational cost and almost unimpaired generalization. |
| Researcher Affiliation | Collaboration | Tianlong Chen1, Yu Cheng2, Zhe Gan2, Lu Yuan2, Lei Zhang3, Zhangyang Wang1 1University of Texas at Austin, 2Microsoft Corporation, 3International Digital Economy Academy |
| Pseudocode | Yes | Algorithm 1 Sparse Vi T Co-Exploration (SVi TE+). Initialize: Vi T model f W , Dataset D, Sparsity distribution S = {s1, , s L}, Update schedule { T, Tend, α, fdecay}, Learning rate η 1: Initialize f W with random sparsity S Highly reduced parameter count. 2: for each training iteration t do 3: Sampling a batch bt D 4: Scoring the input token embeddings and selecting the top-k informative tokens Token selection 5: if (t mod T == 0) and t < Tend then 6: for each layer l do 7: ρ = fdecay(t, α, Tend) (1 sl) Nl 8: Performing prune-and-grow with portion ρ w.r.t. certain criterion, generating masks mprune and mgrow to update f W s sparsity patterns Connectivity exploration 9: end for 10: else 11: W = W η W Lt Updating Weights 12: end if 13: end for 14: return a sparse Vi T with a trained token selector |
| Open Source Code | Yes | Our codes are available at https: //github.com/VITA-Group/SVi TE. |
| Open Datasets | Yes | Our experiments are conducted on Image Net with Dei TTiny/Small/Base backbones. |
| Dataset Splits | Yes | Our experiments are conducted on Image Net with Dei TTiny/Small/Base backbones. The detailed training configurations are listed in Table 1, which mainly follows the default setups in [2]. |
| Hardware Specification | No | The paper mentions 'CUDA benchmark mode' but does not specify the exact hardware (GPU/CPU models, memory) used for the experiments. |
| Software Dependencies | No | The paper mentions 'PyTorch-like style' and 'CUDA benchmark mode' but does not provide specific version numbers for any software dependencies. |
| Experiment Setup | Yes | Table 1: Details of training configurations in our experiments, mainly following the settings in [2]. Backbone Update Schedule { T, Tend, α, fdecay} Batch Size Epochs Inherited Settings from Dei T [2] Dei T-Tiny {20000, 1200000, 0.5, cosine} 512 600 Adam W, 0.0005 batchsize 512 , cosine decay warmup 5 epochs, 0.05 weight decay 0.1 label smoothing, augmentations, etc. |