Vision Transformer Adapter for Dense Predictions
Authors: Zhe Chen, Yuchen Duan, Wenhai Wang, Junjun He, Tong Lu, Jifeng Dai, Yu Qiao
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We verify Vi T-Adapter on multiple dense prediction tasks, including object detection, instance segmentation, and semantic segmentation. Notably, without using extra detection data, our Vi T-Adapter-L yields state-of-the-art 60.9 box AP and 53.0 mask AP on COCO testdev. We hope that the Vi T-Adapter could serve as an alternative for vision-specific transformers and facilitate future research. We evaluate the Vi T-Adapter on multiple challenging benchmarks, including COCO (Lin et al., 2014) and ADE20K (Zhou et al., 2017). As shown in Figure 2, our models consistently achieve improved performance compared to the prior arts under the fair pre-training strategy. |
| Researcher Affiliation | Collaboration | Zhe Chen1,2 , Yuchen Duan2,3 , Wenhai Wang2 , Junjun He2, Tong Lu1 , Jifeng Dai2,3, Yu Qiao2 1Nanjing University, 2Shanghai AI Laboratory, 3Tsinghua University czcz94cz@gmail.com, {duanyuchen,wangwenhai,hejunjun}@pjlab.org.cn lutong@nju.edu.cn, {daijifeng,qiaoyu}@pjlab.org.cn |
| Pseudocode | No | The paper does not contain any pseudocode or explicitly labeled algorithm blocks. |
| Open Source Code | No | Code and models will be released at https://github.com/czczup/Vi T-Adapter. |
| Open Datasets | Yes | Our detection experiments are based on MMDetection (Chen et al., 2019b) and the COCO (Lin et al., 2014) dataset. We evaluate our Vi T-Adapter on semantic segmentation with the ADE20K (Zhou et al., 2017) dataset and MMSegmentation (Contributors, 2020) codebase. |
| Dataset Splits | Yes | Figure 2: Object detection performance on COCO val2017 using Mask R-CNN. Table 1: Object detection and instance segmentation with Mask R-CNN on COCO val2017. |
| Hardware Specification | Yes | The per-iteration training time and GPU training memory are measured by A100 GPUs with per-GPU batch size 2 and FP16 training. |
| Software Dependencies | No | The paper mentions using MMDetection and MMSegmentation codebases but does not provide specific version numbers for these or other software dependencies. |
| Experiment Setup | Yes | Following common practices (Wang et al., 2021), we adopt 1 or 3 training schedule (i.e., 12 or 36 epochs) with a batch size of 16, and Adam W (Loshchilov & Hutter, 2017) optimizer with an initial learning rate of 1 10 4 and a weight decay of 0.05. We use a layer-wise learning rate decay of 0.9, and a drop path rate of 0.4. |