Video Object Segmentation in Panoptic Wild Scenes

Authors: Yuanyou Xu, Zongxin Yang, Yi Yang

IJCAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results show that VIPOSeg can not only boost the performance of VOS models by panoptic training but also evaluate them comprehensively in panoptic scenes. Previous methods for classic VOS still need to improve in performance and efficiency when dealing with panoptic scenes, while our PAOT achieves SOTA performance with good efficiency on VIPOSeg and previous VOS benchmarks.
Researcher Affiliation Collaboration Yuanyou Xu1,2 , Zongxin Yang1 , Yi Yang1 1Re LER, CCAI, Zhejiang University 2Baidu Research {yoxu, zongxinyang, yangyics}@zju.edu.cn
Pseudocode No The paper includes architectural diagrams (Figure 4, 5) but no formal pseudocode or algorithm blocks.
Open Source Code Yes Our dataset and code are available at https: //github.com/yoxu515/VIPOSeg-Benchmark.
Open Datasets Yes The datasets for training include DAVIS 2017 (D) [Pont-Tuset et al., 2017], You Tube-VOS 2019 (Y) [Xu et al., 2018] and our VIPOSeg (V).
Dataset Splits Yes In terms of VIPSeg, the 3536 videos are split into 2,806/343/387 for training, validation and test. We only use training and validation sets in our VIPOSeg (3149 videos in total) because the annotations for test set are private. In order to add unseen classes to validation set, we re-split the videos into new training and validation set.
Hardware Specification Yes During training, we use 4 Nvidia Tesla A100 GPUs, and the batch size is 16. ... The measure of FPS and memory is on Nvidia Tesla A100 GPU.
Software Dependencies No The paper mentions specific model architectures (e.g., Res Net-50, Swin Transformer-Base) but does not provide specific version numbers for software dependencies like PyTorch, TensorFlow, or CUDA.
Experiment Setup Yes The encoder backbones of PAOT models are chosen in Res Net-50 [He et al., 2016] and Swin Transformer-Base [Liu et al., 2021]. As for multi-scale object matching, we set E-LSTT in four scales 16 , 16 , 8 , 4 to be 2,1,1,0 layers respectively (4 layers in total). ... During training, we use 4 Nvidia Tesla A100 GPUs, and the batch size is 16. For pre-training, we use an initial learning rate of 4 10 4 for 100,000 steps. For main training, the initial learning rate is 2 10 4, and the training steps are 100,000. The learning rate gradually decays to 1 10 5 in a polynomial manner [Yang et al., 2020].