SyncVIS: Synchronized Video Instance Segmentation
Authors: Rongkun Zheng, Lu Qi, Xi Chen, Yi Wang, Kun Wang, Yu Qiao, Hengshuang Zhao
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experimental evaluations are conducted on the challenging You Tube-VIS 2019 & 2021 & 2022, and OVIS benchmarks, and Sync VIS achieves state-of-the-art results, which demonstrates the effectiveness and generality of the proposed approach. |
| Researcher Affiliation | Collaboration | Rongkun Zheng1 Lu Qi2 Xi Chen1 Yi Wang3 Kun Wang4 Yu Qiao3 Hengshuang Zhao1 1The University of Hong Kong 2University of California, Merced 3Shanghai Artificial Intelligence Laboratory 4Sense Time Research |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code is available at https://github.com/rkzheng99/Sync VIS. |
| Open Datasets | Yes | We evaluate our Sync VIS on four popular VIS benchmarks, including You Tube-VIS 2019 & 2021 & 2022 [34], and OVIS-2021 [27]. ... Following the design of Mask2Former-VIS [6], we first trained our model on COCO [20] before training on VIS datasets. |
| Dataset Splits | Yes | The validation set has an average length of 27.4 frames per video and covers 40 predefined categories. ... As a result, the validation videos average length increased to 39.7 frames. The most recent update, You Tube-VIS 2022, adds an additional 71 long videos to the validation set and 89 extra long videos to the test set. |
| Hardware Specification | Yes | Most of our experiments are conducted on 4 A100 GPUs (80G) |
| Software Dependencies | Yes | Most of our experiments are conducted on 4 A100 GPUs (80G), and on a cuda 11.1, Py Torch 3.9 environment. |
| Experiment Setup | Yes | Hyper-parameters regarding the pixel and transformer decoder are the same as these of Mask2Former-VIS [6]. In the synchronized video-frame modeling, we set the number of frame-level and video-level embeddings N to 100. To extract the key information, we set the Nk to 10. Following the design of Mask2Former-VIS [6], we first trained our model on COCO [20] before training on VIS datasets. We use the Adam W [23] optimizer with a base learning rate of 5e-4 on Swin-Large backbone in Youtube VIS 2019 (we use different training iterations and learning rates for different datasets). During inference, each frame s shorter side is resized to 360 pixels for Res Net [13] and 448 pixels for Swin [22]. |