Unsupervised Multi-Object Segmentation by Predicting Probable Motion Patterns
Authors: Laurynas Karazija, Subhabrata Choudhury, Iro Laina, Christian Rupprecht, Andrea Vedaldi
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the advantage of this approach over its deterministic counterpart and show state-of-the-art unsupervised object segmentation performance on simulated and real-world benchmarks, surpassing methods that use motion even at test time. As our approach is applicable to variety of network architectures that segment the scenes, we also apply it to existing image reconstruction-based models showing drastic improvement. |
| Researcher Affiliation | Academia | Visual Geometry Group University of Oxford Oxford, UK {laurynas,subha,iro,chrisr,vedaldi}@robots.ox.ac.uk |
| Pseudocode | No | The paper does not include any pseudocode or algorithm blocks. |
| Open Source Code | Yes | Project page and code: https://www.robots.ox.ac.uk/~vgg/research/ppmp. |
| Open Datasets | Yes | Datasets. We evaluate our method on video and still image datasets. For video-based data, we use the Multi-Object Video (MOVi) datasets, released as part of Kubric [20]. Specifically, we employ MOVi-{A,C,D,E} versions. ... To evaluate our method on still images, we use CLEVR [25] and CLEVRTEX [27] benchmark suites. ... Since our method requires optical flow during training, we extend the implementation of [27] to generate video datasets of CLEVR and CLEVRTEX scenes ... We also evaluate our method on the real-world KITTI [16] benchmark which depicts street scenes captured from a moving car. |
| Dataset Splits | Yes | We generate 10k sequences for MOVINGCLEVRTEX and 5k for MOVINGCLEVR, where we retain 1000 and 500 sequences, respectively, for validation. Each sequence is 5 frames long. ... We follow the set up of [3], using 147 videos for training and evaluate on the instance segmentation subset which contains 200 annotated validation frames. |
| Hardware Specification | Yes | The model takes approximately 48h to train on a single A30 24GB GPU.2 All training details and hyper-parameters are included in the Appendix. 2Approx. total compute in this paper: 100 GPU days for our models, 154 GPU days for comparisons. |
| Software Dependencies | No | The paper mentions using Mask2Former [7], a CNN backbone, ResNet-18, Swin-tiny transformer [36], and RAFT [48] but does not provide specific version numbers for these software components or any other libraries/environments. |
| Experiment Setup | No | The paper states 'All training details and hyper-parameters are included in the Appendix,' indicating that these specific details are not present in the main text. Only general settings like network architectures are mentioned without specific values. |