InsPro: Propagating Instance Query and Proposal for Online Video Instance Segmentation

Authors: Fei He, Haoyang Zhang, Naiyu Gao, Jian Jia, Yanhu Shan, Xin Zhao, Kaiqi Huang

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To show the effectiveness of our method Ins Pro, we evaluate it on two popular VIS benchmarks, i.e., You Tube-VIS 2019 and You Tube-VIS 2021. Without bells-and-whistles, our Ins Pro with Res Net-50 backbone achieves 43.2 AP and 37.6 AP on these two benchmarks respectively, outperforming all other online VIS methods.
Researcher Affiliation Collaboration Fei He1,2, Haoyang Zhang4, Naiyu Gao4, Jian Jia1,2, Yanhu Shan4, Xin Zhao1,2 , Kaiqi Huang1,2,3 1 CRISE, Institute of Automation, Chinese Academy of Sciences 2 School of Artificial Intelligence, University of Chinese Academy of Sciences 3 CAS Center for Excellence in Brain Science and Intelligence Technology 4 Horizon Robotics {hefei2018,jiajian2018}@ia.ac.cn, {haoyang.zhang,naiyu01.gao,yanhu.shan}@horizon.ai {xzhao,kaiqi.huang}@nlpr.ia.ac.cn
Pseudocode No The paper describes the method using prose and diagrams (Figure 1, Figure 2) but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No We list the data and instructions in Sec. 4 and Appendix, and the code will be released upon publication.
Open Datasets Yes We evaluate our method on You Tube-VIS 2019 and 2021 benchmarks [1].
Dataset Splits Yes You Tube-VIS 2019 consists of 2,238 training videos and 302 validation videos, and labels 40 object categories. You Tube VIS 2021 is an extended version, which comprises 2,985 training videos and 421 validation videos, and labels improved 40 categories.
Hardware Specification Yes Moreover, our lite variant, Ins Pro-lite, reaches 38.7 AP on You Tube-VIS 2019 at impressive 45.7 FPS on a Nvidia RTX2080Ti GPU. The training is performed end-to-end on 8 Nvidia RTX2080Ti GPUs
Software Dependencies No We implement our Ins Pro with Detectron2 [49], and most hyperparameters are set following Sparse R-CNN [12] and Cond Inst [39] unless otherwise specified.
Experiment Setup Yes We employ Adam W [50] with an initial learning rate of 2.5 10 5 and weight decay 0.0001 as our model optimizer. We initialize our model with parameters pre-trained on COCO [51], and train it for 32k iterations where the learning rate is divided by 10 at iterations 24k and 28k, respectively. The training is performed end-to-end on 8 Nvidia RTX2080Ti GPUs and each GPU holds one mini-batch which contains three frame images randomly sampled from the same video. Data augmentation includes only random horizontal flip and multi-scale training where the training image is resized so that the length of its shortest side is at least 288 and at most 512. Unless otherwise noted, our Ins Pro adopts Res Net-50 [13] as backbone and uses 100 instance queries in our experiments.