reproducibilityindex.ai

Video Object Segmentation in Panoptic Wild Scenes

Authors: Yuanyou Xu, Zongxin Yang, Yi Yang

IJCAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results show that VIPOSeg can not only boost the performance of VOS models by panoptic training but also evaluate them comprehensively in panoptic scenes. Previous methods for classic VOS still need to improve in performance and efficiency when dealing with panoptic scenes, while our PAOT achieves SOTA performance with good efficiency on VIPOSeg and previous VOS benchmarks.
Researcher Affiliation	Collaboration	Yuanyou Xu1,2 , Zongxin Yang1 , Yi Yang1 1Re LER, CCAI, Zhejiang University 2Baidu Research {yoxu, zongxinyang, yangyics}@zju.edu.cn
Pseudocode	No	The paper includes architectural diagrams (Figure 4, 5) but no formal pseudocode or algorithm blocks.
Open Source Code	Yes	Our dataset and code are available at https: //github.com/yoxu515/VIPOSeg-Benchmark.
Open Datasets	Yes	The datasets for training include DAVIS 2017 (D) [Pont-Tuset et al., 2017], You Tube-VOS 2019 (Y) [Xu et al., 2018] and our VIPOSeg (V).
Dataset Splits	Yes	In terms of VIPSeg, the 3536 videos are split into 2,806/343/387 for training, validation and test. We only use training and validation sets in our VIPOSeg (3149 videos in total) because the annotations for test set are private. In order to add unseen classes to validation set, we re-split the videos into new training and validation set.
Hardware Specification	Yes	During training, we use 4 Nvidia Tesla A100 GPUs, and the batch size is 16. ... The measure of FPS and memory is on Nvidia Tesla A100 GPU.
Software Dependencies	No	The paper mentions specific model architectures (e.g., Res Net-50, Swin Transformer-Base) but does not provide specific version numbers for software dependencies like PyTorch, TensorFlow, or CUDA.
Experiment Setup	Yes	The encoder backbones of PAOT models are chosen in Res Net-50 [He et al., 2016] and Swin Transformer-Base [Liu et al., 2021]. As for multi-scale object matching, we set E-LSTT in four scales 16 , 16 , 8 , 4 to be 2,1,1,0 layers respectively (4 layers in total). ... During training, we use 4 Nvidia Tesla A100 GPUs, and the batch size is 16. For pre-training, we use an initial learning rate of 4 10 4 for 100,000 steps. For main training, the initial learning rate is 2 10 4, and the training steps are 100,000. The learning rate gradually decays to 1 10 5 in a polynomial manner [Yang et al., 2020].