Expediting Large-Scale Vision Transformer for Dense Prediction without Fine-tuning
Authors: WEICONG LIANG, YUHUI YUAN, Henghui Ding, Xiao Luo, Weihong Lin, Ding Jia, Zheng Zhang, Chao Zhang, Han Hu
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The results obtained by our method are promising on five dense prediction tasks, including object detection, semantic segmentation, panoptic segmentation, instance segmentation, and depth estimation. Accordingly, our method accelerates 40% FPS and saves 30% GFLOPs of Segmenter+Vi T-L/16 while maintaining 99.5% of the performance on ADE20K without fine-tuning the official weights. |
| Researcher Affiliation | Collaboration | 1Key Laboratory of Machine Perception (MOE) School of Intelligence Science and Technology, Peking University 2School of Mathematical Sciences, Peking University 3ETH Zurich 4Microsoft Research Asia |
| Pseudocode | No | The paper describes the token clustering and reconstruction layers using mathematical equations and textual explanations, but it does not include a clearly labeled pseudocode or algorithm block. |
| Open Source Code | No | We will release the code soon. |
| Open Datasets | Yes | COCO [44]. This dataset consists of 123K images with 896K annotated bounding boxes belonging to 80 thing classes and 53 stuff classes, where the train set contains 118K images and the val set contains 5K images. ADE20K [87]...The train set contains 20, 210 images...The val set contains 2, 000 images. |
| Dataset Splits | Yes | COCO [44]. This dataset consists of 123K images with 896K annotated bounding boxes belonging to 80 thing classes and 53 stuff classes, where the train set contains 118K images and the val set contains 5K images. ADE20K [87]...The train set contains 20, 210 images...The val set contains 2, 000 images. PASCAL-Context [51]...the train set contains 4, 996 images with and the val set contains 5, 104 images. Cityscapes [13]...The train set and val set contains 2, 975 and 500 images respectively. KITTI [21]...around 26K images for train set and 698 images for val set, where only 653 images have the ground-truth depth maps... NYUv2 [58]...We report the depth prediction results of DPT [53] evaluated on 655 val images. |
| Hardware Specification | Yes | FPS is tested on a single V100 GPU with Pytorch 1.10 and CUDA 10.2 by default. |
| Software Dependencies | Yes | FPS is tested on a single V100 GPU with Pytorch 1.10 and CUDA 10.2 by default. |
| Experiment Setup | Yes | Hyper-parameters of token clustering/reconstruction layer. We first study the influence of the hyper-parameters associated with the token clustering layer, i.e., the number of neighboring pixels λ used in Equation 3, the number of EM iterations κ, and the choice of the temperature τ in Table 2. ... In summary, we choose λ as 5 5, κ as 5, and τ as 50 considering both performance and efficiency. Next, we also study the influence of the hyper-parameters within the token clustering layer, i.e., the number of nearest neighbors k within k-NN. We do not observe obvious differences and thus set k as 20. ... We choose α = 10, α + β = 24, and γ = 0 for all ablation experiments on ADE20K by default if not specified. |