The Emergence of Objectness: Learning Zero-shot Segmentation from Videos
Authors: Runtao Liu, Zhirong Wu, Stella Yu, Stephen Lin
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our model demonstrates the surprising emergence of objectness in the appearance pathway, surpassing prior works on 1) zero-shot object segmentation from a single image, 2) moving object segmentation from a video with unsupervised test-time adaptation, and 3) semantic image segmentation with supervised fine-tuning. Our work is the first truly end-to-end learned zero-shot object segmentation model from unlabeled videos. It not only develops generic objectness for segmentation and tracking, but also outperforms image-based contrastive representation learning without augmentation engineering. [...] 4 Experiments. Tasks. We train our AMD model on unlabeled videos and test it on three downstream applications. 1) Zero-shot object segmentation. [...] 4.1 Zero-Shot Saliency Detection. [...] 4.2 Zero-shot Video Object Segmentation. [...] 4.3 Semantic Segmentation. [...] 4.4 Ablation Study. |
| Researcher Affiliation | Collaboration | Runtao Liu1,2 Zhirong Wu1 Stella X. Yu3 Stephen Lin1 Microsoft Research Asia1 John Hopkins University2 UC Berkeley / ICSI3 |
| Pseudocode | No | The paper describes the model architecture and process through text and diagrams (Figure 2), but it does not include formal pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code is available at https://github.com/rt219/The-Emergence-of-Objectness. |
| Open Datasets | Yes | Our training videos come from Youtube-VOS [66], a large object-centric video dataset. Its training split contains about 4,000 videos covering 94 categories of objects. The total duration of the dataset is 334 minutes. We sample video frames at 24 frames per second, without using any segmentation labels provided in Youtube-VOS. [66] Ning Xu, Linjie Yang, Yuchen Fan, Dingcheng Yue, Yuchen Liang, Jianchao Yang, and Thomas Huang. Youtube-vos: A large-scale video object segmentation benchmark. arXiv preprint arXiv:1809.03327, 2018. |
| Dataset Splits | Yes | We finetune our AMD model on the PASCAL VOC training set and evaluate it on the validation set. The finetuning takes 40,000 iterations with batch size 16 and the initial learning rate 0.01. The learning rate undergoes polynomial decay with a power parameter of 0.9. |
| Hardware Specification | Yes | We train AMD on 8 V100 GPUs, with each processing two pairs of sampled adjacent frames. |
| Software Dependencies | No | The paper mentions using ResNet50 as a backbone and PWC-Net for the motion network, but it does not specify any software versions for frameworks like PyTorch, TensorFlow, or programming languages used for implementation. |
| Experiment Setup | Yes | We resize the shorter edge of the input image to 400 pixels, and randomly crop a square image of size 384 384 with random horizontal flipping augmentation. No other augmentations are used. We adopt the symmetric reconstruction loss that considers either frame as the target frame and sums the two reconstruction errors. We use the Adam optimizer with a learning rate of 1 10 4 and a weight decay of 1 10 6. We train AMD on 8 V100 GPUs, with each processing two pairs of sampled adjacent frames. The network is optimized for 400K iterations. |