Segmenting Moving Objects via an Object-Centric Layered Representation
Authors: Junyu Xie, Weidi Xie, Andrew Zisserman
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct thorough ablation studies, showing that the model is able to learn object permanence and temporal shape consistency, and is able to predict amodal segmentation masks; Fourth, we evaluate our model, trained only on synthetic data, on standard video segmentation benchmarks, DAVIS, Mo CA, Seg Track, FBMS-59, and achieve stateof-the-art performance among existing methods that do not rely on any manual annotations. |
| Researcher Affiliation | Academia | Junyu Xie1 Weidi Xie1,2 Andrew Zisserman1 1Visual Geometry Group, Department of Engineering Science, University of Oxford, UK 2Coop. Medianet Innovation Center, Shanghai Jiao Tong University, China |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | No | In the Supplementary Material, we have provided the details in the curated DAVIS2017-motion dataset, together with some demonstration videos of our synthetic dataset. The full datasets and codes for training and evaluation will be released before the publication. |
| Open Datasets | Yes | To evaluate our multi-layer model, we benchmark on several popular datasets for video object segmentation tasks. A brief overview of the datasets is given below, with full details in the Supplementary Material. For single object video segmentation, we evaluate the model on DAVIS2016 [49], Seg Trackv2 [34], FBMS-59 [46] and Mo CA [33]. (...) The real object masks are directly sourced from the silhouettes in the You Tube VOS dataset [66]. These generated objects are then applied with textures sampled from the PASS dataset [3]. |
| Dataset Splits | Yes | We prepare 4k synthetic sequences (around 120k frames) for training, the video sequences contain 1, 2, or 3 objects in equal proportions. In the following, we train all models on this synthetic video dataset unless otherwise specified. To benchmark motion-based segmentation for multiple objects, we introduce a synthetic validation dataset (Syn-val) and a curated dataset (DAVIS2017-motion). The former is generated with the same parameters as our synthetic training set (Sect. 3), containing over 300 multi-object sequences (around 10k frames) with 1, 2, 3 objects at equal proportions, and controllable occlusions for evaluating modal and amodal segmentations in the ablation studies. |
| Hardware Specification | No | The main paper states: 'In the Supplementary Material, we have reported the amount of computation to train, test and test-time adapt our model, together with details of the adopted GPU and approximate time taken.' However, the details themselves are not provided within the main body of the paper. |
| Software Dependencies | No | The paper mentions software like 'RAFT' for optical flow estimation, 'Adam optimizer', and 'DINO-pretrained vision transformer', but does not specify their version numbers. |
| Experiment Setup | Yes | During training, we split the video sequences into 30 frames per sample, each input frame is first encoded by a U-Net encoder into a feature map with 1/16 of its original spatial resolution, and passed to the transformer bottleneck. We use K = 3 learnable object queries, associating to 3 independent foreground layers. The model is trained by the Adam optimizer [28] with a learning rate linearly warmed up to 5 10 5 during 40k iterations, and decreased by half every 80k iterations. |