Query-Memory Re-Aggregation for Weakly-supervised Video Object Segmentation

Authors: Fanchao Lin, Hongtao Xie, Yan Li, Yongdong Zhang2038-2046

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results on three benchmarks show that our method achieves the state-of-the-art performance in WVOS (e.g., an overall score of 84.7% on the DAVIS 2016 validation set).
Researcher Affiliation Collaboration 1 School of Information Science and Technology, University of Science and Technology of China, Hefei, China 2 Beijing Kuaishou Technology Co., Ltd., Beijing, China
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide concrete access to source code for the methodology described.
Open Datasets Yes We conduct experiments on three public datasets: the single-object DAVIS 2016 (Perazzi et al. 2016) dataset and the multi-object DAVIS 2017 (Pont-Tuset et al. 2017) and You Tube-VOS (Xu et al. 2018) datasets. For the pre-training on static data, we generate pairs of simulative video frames from the salient object segmentation datasets, i.e., DUTS (Wang et al. 2017), HKU-IS (Li and Yu 2015), MSRA (Cheng et al. 2014), and SOC (Fan et al. 2018).
Dataset Splits Yes DAVIS 2016 contains 50 videos, which are divided into a training set (30videos) and a validation set (20 videos). DAVIS 2017 is an extended dataset of DAVIS 2016, which has 60 videos for training and 30 videos for validation with multiple targets per video. You Tube-VOS is a large-scale dataset consists of 3471 training videos and 474 validation videos.
Hardware Specification Yes We train our network using the Adam algorithm with a fixed learning rate of 1e-5 on four GTX 1080Ti GPUs, and the batch size is 16. During the inference, only the initial bounding box label is given and the prediction is made in a propagation-like way. Our method is evaluated on a computer with a single V100 GPU.
Software Dependencies No The paper mentions deep learning frameworks and optimizers but does not provide specific version numbers for ancillary software dependencies.
Experiment Setup Yes We train our network using the Adam algorithm with a fixed learning rate of 1e-5 on four GTX 1080Ti GPUs, and the batch size is 16. For the encoders in our framework, both the query encoder fq and the memory encoder fm use the Res Net50 (He et al. 2016) till the 4-th stage as the backbone, but fm adds extra filters in the input layer so that it can take 4 channels (RGB frame and a bounding box map) as input. The loss function of the whole framework is: L = l(Bq, Bl) + l(Sq, Sl) + l(Sr q, Sl) where l is a combination of the dice loss and the cross-entropy loss, and the weight of dice loss is set to 0.1 by experience.