reproducibilityindex.ai

VITA: Video Instance Segmentation via Object Token Association

Authors: Miran Heo, Sukjun Hwang, Seoung Wug Oh, Joon-Young Lee, Seon Joo Kim

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate VITA on three popular VIS benchmarks, You Tube-VIS 2019 & 2021 [32] and OVIS [24]. With Res Net-50 [14] backbone, VITA achieves the new state-of-the-arts of 49.8 AP & 45.7 AP on You Tube-VIS 2019 & 2021, and 19.6 AP on OVIS. Above all, VITA outperforms the previous best approaches by 5.1 AP for You Tube-VIS 2021, which contains more complicated and long sequences than You Tube-VIS 2019.
Researcher Affiliation	Collaboration	Miran Heo Yonsei University Sukjun Hwang Yonsei University Seoung Wug Oh Adobe Research Joon-Young Lee Adobe Research Seon Joo Kim Yonsei University
Pseudocode	No	No pseudocode or algorithm blocks were found in the paper.
Open Source Code	Yes	Code is available at https://github.com/sukjunhwang/ VITA.
Open Datasets	Yes	We evaluate VITA on three popular VIS benchmarks, You Tube-VIS 2019 & 2021 [32] and OVIS [24]. ... we ﬁrst train our model on the COCO [20] dataset following Mask2Former. Then, we train our method on the VIS datasets [32, 24] simultaneously with pseudo videos generated from images [20] following the details of Seq Former [30].
Dataset Splits	Yes	You Tube-VIS 2021. In order to address more difﬁcult scenarios, additional videos are included in You Tube-VIS2021 (794 videos for training and 129 videos for validation). ... OVIS. ... Finally, the average length of videos for the valid set is 62.7 frames (the longest video has 292 frames) which is much longer than that of You Tube-VIS.
Hardware Specification	Yes	VITA is the ﬁrst ofﬂine method that presents the results on OVIS benchmark that consists of long videos (the longest video has 292 frames) using a single 12GB GPU. ... To take into account the general environment, all results are computed using a single 12GB Titan XP GPU.
Software Dependencies	No	Our method is implemented on top of detectron2 [31]. No specific version numbers for software dependencies are provided.
Experiment Setup	Yes	All hyper-parameters regarding the framelevel detector are equal to the defaults of Mask2Former [7]. The total loss Ltotal is balanced with λv, λf, and λsim where 1.0, 1.0, and 0.5, respectively. By default, Object Encoder is composed of three layers with the window size W = 6, and Object Decoder employs six layers with Nv = 100 video queries. Having VITA built on top of Mask2Former, we ﬁrst train our model on the COCO [20] dataset following Mask2Former. Then, we train our method on the VIS datasets [32, 24] simultaneously with pseudo videos generated from images [20] following the details of Seq Former [30]. During inference, each frame is resized to a shorter edge size of 360 and 448 pixels when using Res Net [14] and Swin [22] backbones, respectively.