Video Object Segmentation in Panoptic Wild Scenes
Authors: Yuanyou Xu, Zongxin Yang, Yi Yang
IJCAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results show that VIPOSeg can not only boost the performance of VOS models by panoptic training but also evaluate them comprehensively in panoptic scenes. Previous methods for classic VOS still need to improve in performance and efficiency when dealing with panoptic scenes, while our PAOT achieves SOTA performance with good efficiency on VIPOSeg and previous VOS benchmarks. |
| Researcher Affiliation | Collaboration | Yuanyou Xu1,2 , Zongxin Yang1 , Yi Yang1 1Re LER, CCAI, Zhejiang University 2Baidu Research {yoxu, zongxinyang, yangyics}@zju.edu.cn |
| Pseudocode | No | The paper includes architectural diagrams (Figure 4, 5) but no formal pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our dataset and code are available at https: //github.com/yoxu515/VIPOSeg-Benchmark. |
| Open Datasets | Yes | The datasets for training include DAVIS 2017 (D) [Pont-Tuset et al., 2017], You Tube-VOS 2019 (Y) [Xu et al., 2018] and our VIPOSeg (V). |
| Dataset Splits | Yes | In terms of VIPSeg, the 3536 videos are split into 2,806/343/387 for training, validation and test. We only use training and validation sets in our VIPOSeg (3149 videos in total) because the annotations for test set are private. In order to add unseen classes to validation set, we re-split the videos into new training and validation set. |
| Hardware Specification | Yes | During training, we use 4 Nvidia Tesla A100 GPUs, and the batch size is 16. ... The measure of FPS and memory is on Nvidia Tesla A100 GPU. |
| Software Dependencies | No | The paper mentions specific model architectures (e.g., Res Net-50, Swin Transformer-Base) but does not provide specific version numbers for software dependencies like PyTorch, TensorFlow, or CUDA. |
| Experiment Setup | Yes | The encoder backbones of PAOT models are chosen in Res Net-50 [He et al., 2016] and Swin Transformer-Base [Liu et al., 2021]. As for multi-scale object matching, we set E-LSTT in four scales 16 , 16 , 8 , 4 to be 2,1,1,0 layers respectively (4 layers in total). ... During training, we use 4 Nvidia Tesla A100 GPUs, and the batch size is 16. For pre-training, we use an initial learning rate of 4 10 4 for 100,000 steps. For main training, the initial learning rate is 2 10 4, and the training steps are 100,000. The learning rate gradually decays to 1 10 5 in a polynomial manner [Yang et al., 2020]. |