VITA: Video Instance Segmentation via Object Token Association
Authors: Miran Heo, Sukjun Hwang, Seoung Wug Oh, Joon-Young Lee, Seon Joo Kim
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate VITA on three popular VIS benchmarks, You Tube-VIS 2019 & 2021 [32] and OVIS [24]. With Res Net-50 [14] backbone, VITA achieves the new state-of-the-arts of 49.8 AP & 45.7 AP on You Tube-VIS 2019 & 2021, and 19.6 AP on OVIS. Above all, VITA outperforms the previous best approaches by 5.1 AP for You Tube-VIS 2021, which contains more complicated and long sequences than You Tube-VIS 2019. |
| Researcher Affiliation | Collaboration | Miran Heo Yonsei University Sukjun Hwang Yonsei University Seoung Wug Oh Adobe Research Joon-Young Lee Adobe Research Seon Joo Kim Yonsei University |
| Pseudocode | No | No pseudocode or algorithm blocks were found in the paper. |
| Open Source Code | Yes | Code is available at https://github.com/sukjunhwang/ VITA. |
| Open Datasets | Yes | We evaluate VITA on three popular VIS benchmarks, You Tube-VIS 2019 & 2021 [32] and OVIS [24]. ... we first train our model on the COCO [20] dataset following Mask2Former. Then, we train our method on the VIS datasets [32, 24] simultaneously with pseudo videos generated from images [20] following the details of Seq Former [30]. |
| Dataset Splits | Yes | You Tube-VIS 2021. In order to address more difficult scenarios, additional videos are included in You Tube-VIS2021 (794 videos for training and 129 videos for validation). ... OVIS. ... Finally, the average length of videos for the valid set is 62.7 frames (the longest video has 292 frames) which is much longer than that of You Tube-VIS. |
| Hardware Specification | Yes | VITA is the first offline method that presents the results on OVIS benchmark that consists of long videos (the longest video has 292 frames) using a single 12GB GPU. ... To take into account the general environment, all results are computed using a single 12GB Titan XP GPU. |
| Software Dependencies | No | Our method is implemented on top of detectron2 [31]. No specific version numbers for software dependencies are provided. |
| Experiment Setup | Yes | All hyper-parameters regarding the framelevel detector are equal to the defaults of Mask2Former [7]. The total loss Ltotal is balanced with λv, λf, and λsim where 1.0, 1.0, and 0.5, respectively. By default, Object Encoder is composed of three layers with the window size W = 6, and Object Decoder employs six layers with Nv = 100 video queries. Having VITA built on top of Mask2Former, we first train our model on the COCO [20] dataset following Mask2Former. Then, we train our method on the VIS datasets [32, 24] simultaneously with pseudo videos generated from images [20] following the details of Seq Former [30]. During inference, each frame is resized to a shorter edge size of 360 and 448 pixels when using Res Net [14] and Swin [22] backbones, respectively. |