SyncVIS: Synchronized Video Instance Segmentation

Authors: Rongkun Zheng, Lu Qi, Xi Chen, Yi Wang, Kun Wang, Yu Qiao, Hengshuang Zhao

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experimental evaluations are conducted on the challenging You Tube-VIS 2019 & 2021 & 2022, and OVIS benchmarks, and Sync VIS achieves state-of-the-art results, which demonstrates the effectiveness and generality of the proposed approach.
Researcher Affiliation Collaboration Rongkun Zheng1 Lu Qi2 Xi Chen1 Yi Wang3 Kun Wang4 Yu Qiao3 Hengshuang Zhao1 1The University of Hong Kong 2University of California, Merced 3Shanghai Artificial Intelligence Laboratory 4Sense Time Research
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes The code is available at https://github.com/rkzheng99/Sync VIS.
Open Datasets Yes We evaluate our Sync VIS on four popular VIS benchmarks, including You Tube-VIS 2019 & 2021 & 2022 [34], and OVIS-2021 [27]. ... Following the design of Mask2Former-VIS [6], we first trained our model on COCO [20] before training on VIS datasets.
Dataset Splits Yes The validation set has an average length of 27.4 frames per video and covers 40 predefined categories. ... As a result, the validation videos average length increased to 39.7 frames. The most recent update, You Tube-VIS 2022, adds an additional 71 long videos to the validation set and 89 extra long videos to the test set.
Hardware Specification Yes Most of our experiments are conducted on 4 A100 GPUs (80G)
Software Dependencies Yes Most of our experiments are conducted on 4 A100 GPUs (80G), and on a cuda 11.1, Py Torch 3.9 environment.
Experiment Setup Yes Hyper-parameters regarding the pixel and transformer decoder are the same as these of Mask2Former-VIS [6]. In the synchronized video-frame modeling, we set the number of frame-level and video-level embeddings N to 100. To extract the key information, we set the Nk to 10. Following the design of Mask2Former-VIS [6], we first trained our model on COCO [20] before training on VIS datasets. We use the Adam W [23] optimizer with a base learning rate of 5e-4 on Swin-Large backbone in Youtube VIS 2019 (we use different training iterations and learning rates for different datasets). During inference, each frame s shorter side is resized to 360 pixels for Res Net [13] and 448 pixels for Swin [22].