The Emergence of Objectness: Learning Zero-shot Segmentation from Videos

Authors: Runtao Liu, Zhirong Wu, Stella Yu, Stephen Lin

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our model demonstrates the surprising emergence of objectness in the appearance pathway, surpassing prior works on 1) zero-shot object segmentation from a single image, 2) moving object segmentation from a video with unsupervised test-time adaptation, and 3) semantic image segmentation with supervised fine-tuning. Our work is the first truly end-to-end learned zero-shot object segmentation model from unlabeled videos. It not only develops generic objectness for segmentation and tracking, but also outperforms image-based contrastive representation learning without augmentation engineering. [...] 4 Experiments. Tasks. We train our AMD model on unlabeled videos and test it on three downstream applications. 1) Zero-shot object segmentation. [...] 4.1 Zero-Shot Saliency Detection. [...] 4.2 Zero-shot Video Object Segmentation. [...] 4.3 Semantic Segmentation. [...] 4.4 Ablation Study.
Researcher Affiliation Collaboration Runtao Liu1,2 Zhirong Wu1 Stella X. Yu3 Stephen Lin1 Microsoft Research Asia1 John Hopkins University2 UC Berkeley / ICSI3
Pseudocode No The paper describes the model architecture and process through text and diagrams (Figure 2), but it does not include formal pseudocode or algorithm blocks.
Open Source Code Yes Our code is available at https://github.com/rt219/The-Emergence-of-Objectness.
Open Datasets Yes Our training videos come from Youtube-VOS [66], a large object-centric video dataset. Its training split contains about 4,000 videos covering 94 categories of objects. The total duration of the dataset is 334 minutes. We sample video frames at 24 frames per second, without using any segmentation labels provided in Youtube-VOS. [66] Ning Xu, Linjie Yang, Yuchen Fan, Dingcheng Yue, Yuchen Liang, Jianchao Yang, and Thomas Huang. Youtube-vos: A large-scale video object segmentation benchmark. arXiv preprint arXiv:1809.03327, 2018.
Dataset Splits Yes We finetune our AMD model on the PASCAL VOC training set and evaluate it on the validation set. The finetuning takes 40,000 iterations with batch size 16 and the initial learning rate 0.01. The learning rate undergoes polynomial decay with a power parameter of 0.9.
Hardware Specification Yes We train AMD on 8 V100 GPUs, with each processing two pairs of sampled adjacent frames.
Software Dependencies No The paper mentions using ResNet50 as a backbone and PWC-Net for the motion network, but it does not specify any software versions for frameworks like PyTorch, TensorFlow, or programming languages used for implementation.
Experiment Setup Yes We resize the shorter edge of the input image to 400 pixels, and randomly crop a square image of size 384 384 with random horizontal flipping augmentation. No other augmentations are used. We adopt the symmetric reconstruction loss that considers either frame as the target frame and sums the two reconstruction errors. We use the Adam optimizer with a learning rate of 1 10 4 and a weight decay of 1 10 6. We train AMD on 8 V100 GPUs, with each processing two pairs of sampled adjacent frames. The network is optimized for 400K iterations.