Expediting Large-Scale Vision Transformer for Dense Prediction without Fine-tuning

Authors: WEICONG LIANG, YUHUI YUAN, Henghui Ding, Xiao Luo, Weihong Lin, Ding Jia, Zheng Zhang, Chao Zhang, Han Hu

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The results obtained by our method are promising on five dense prediction tasks, including object detection, semantic segmentation, panoptic segmentation, instance segmentation, and depth estimation. Accordingly, our method accelerates 40% FPS and saves 30% GFLOPs of Segmenter+Vi T-L/16 while maintaining 99.5% of the performance on ADE20K without fine-tuning the official weights.
Researcher Affiliation Collaboration 1Key Laboratory of Machine Perception (MOE) School of Intelligence Science and Technology, Peking University 2School of Mathematical Sciences, Peking University 3ETH Zurich 4Microsoft Research Asia
Pseudocode No The paper describes the token clustering and reconstruction layers using mathematical equations and textual explanations, but it does not include a clearly labeled pseudocode or algorithm block.
Open Source Code No We will release the code soon.
Open Datasets Yes COCO [44]. This dataset consists of 123K images with 896K annotated bounding boxes belonging to 80 thing classes and 53 stuff classes, where the train set contains 118K images and the val set contains 5K images. ADE20K [87]...The train set contains 20, 210 images...The val set contains 2, 000 images.
Dataset Splits Yes COCO [44]. This dataset consists of 123K images with 896K annotated bounding boxes belonging to 80 thing classes and 53 stuff classes, where the train set contains 118K images and the val set contains 5K images. ADE20K [87]...The train set contains 20, 210 images...The val set contains 2, 000 images. PASCAL-Context [51]...the train set contains 4, 996 images with and the val set contains 5, 104 images. Cityscapes [13]...The train set and val set contains 2, 975 and 500 images respectively. KITTI [21]...around 26K images for train set and 698 images for val set, where only 653 images have the ground-truth depth maps... NYUv2 [58]...We report the depth prediction results of DPT [53] evaluated on 655 val images.
Hardware Specification Yes FPS is tested on a single V100 GPU with Pytorch 1.10 and CUDA 10.2 by default.
Software Dependencies Yes FPS is tested on a single V100 GPU with Pytorch 1.10 and CUDA 10.2 by default.
Experiment Setup Yes Hyper-parameters of token clustering/reconstruction layer. We first study the influence of the hyper-parameters associated with the token clustering layer, i.e., the number of neighboring pixels λ used in Equation 3, the number of EM iterations κ, and the choice of the temperature τ in Table 2. ... In summary, we choose λ as 5 5, κ as 5, and τ as 50 considering both performance and efficiency. Next, we also study the influence of the hyper-parameters within the token clustering layer, i.e., the number of nearest neighbors k within k-NN. We do not observe obvious differences and thus set k as 20. ... We choose α = 10, α + β = 24, and γ = 0 for all ablation experiments on ADE20K by default if not specified.