Segmenting Moving Objects via an Object-Centric Layered Representation

Authors: Junyu Xie, Weidi Xie, Andrew Zisserman

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct thorough ablation studies, showing that the model is able to learn object permanence and temporal shape consistency, and is able to predict amodal segmentation masks; Fourth, we evaluate our model, trained only on synthetic data, on standard video segmentation benchmarks, DAVIS, Mo CA, Seg Track, FBMS-59, and achieve stateof-the-art performance among existing methods that do not rely on any manual annotations.
Researcher Affiliation Academia Junyu Xie1 Weidi Xie1,2 Andrew Zisserman1 1Visual Geometry Group, Department of Engineering Science, University of Oxford, UK 2Coop. Medianet Innovation Center, Shanghai Jiao Tong University, China
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code No In the Supplementary Material, we have provided the details in the curated DAVIS2017-motion dataset, together with some demonstration videos of our synthetic dataset. The full datasets and codes for training and evaluation will be released before the publication.
Open Datasets Yes To evaluate our multi-layer model, we benchmark on several popular datasets for video object segmentation tasks. A brief overview of the datasets is given below, with full details in the Supplementary Material. For single object video segmentation, we evaluate the model on DAVIS2016 [49], Seg Trackv2 [34], FBMS-59 [46] and Mo CA [33]. (...) The real object masks are directly sourced from the silhouettes in the You Tube VOS dataset [66]. These generated objects are then applied with textures sampled from the PASS dataset [3].
Dataset Splits Yes We prepare 4k synthetic sequences (around 120k frames) for training, the video sequences contain 1, 2, or 3 objects in equal proportions. In the following, we train all models on this synthetic video dataset unless otherwise specified. To benchmark motion-based segmentation for multiple objects, we introduce a synthetic validation dataset (Syn-val) and a curated dataset (DAVIS2017-motion). The former is generated with the same parameters as our synthetic training set (Sect. 3), containing over 300 multi-object sequences (around 10k frames) with 1, 2, 3 objects at equal proportions, and controllable occlusions for evaluating modal and amodal segmentations in the ablation studies.
Hardware Specification No The main paper states: 'In the Supplementary Material, we have reported the amount of computation to train, test and test-time adapt our model, together with details of the adopted GPU and approximate time taken.' However, the details themselves are not provided within the main body of the paper.
Software Dependencies No The paper mentions software like 'RAFT' for optical flow estimation, 'Adam optimizer', and 'DINO-pretrained vision transformer', but does not specify their version numbers.
Experiment Setup Yes During training, we split the video sequences into 30 frames per sample, each input frame is first encoded by a U-Net encoder into a feature map with 1/16 of its original spatial resolution, and passed to the transformer bottleneck. We use K = 3 learnable object queries, associating to 3 independent foreground layers. The model is trained by the Adam optimizer [28] with a learning rate linearly warmed up to 5 10 5 during 40k iterations, and decreased by half every 80k iterations.