D^2NeRF: Self-Supervised Decoupling of Dynamic and Static Objects from a Monocular Video
Authors: Tianhao Wu, Fangcheng Zhong, Andrea Tagliasacchi, Forrester Cole, Cengiz Oztireli
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We introduce a new dataset containing various dynamic objects and shadows and demonstrate that our method can achieve better performance than state-of-the-art approaches in decoupling dynamic and static 3D objects, occlusion and shadow removal, and image segmentation for moving objects. Project page: d2nerf.github.io |
| Researcher Affiliation | Collaboration | Tianhao Wu University of Cambridge Fangcheng Zhong University of Cambridge Andrea Tagliasacchi Google Research Simon Fraser University Forrester Cole Google Research Cengiz Oztireli Google Research University of Cambridge |
| Pseudocode | No | The paper does not contain any pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | No | Our method is easily reproducible, as we intend to release code and datasets upon publication to facilitate future research. |
| Open Datasets | Yes | We introduce a new dataset with rigid and non-rigid dynamic objects, rapid camera motion and various moving shadows in both the synthetic and real-world settings to evaluate these two aspects, and show that our method achieves better performance than state-of-the-art approaches. Synthetic dataset We generate a synthetic dataset with ground-truth masks for moving objects and their shadows with Kubric [16]. This dataset consists of five scenes containing one or multiple dynamic objects from Shape Net [5] with rigid or non-rigid motion, and the corresponding Kubric worker script is provided in our supplementary material. |
| Dataset Splits | Yes | We move the virtual camera over 10 keyframes randomly sampled from azimuth [2, 2 + π/4] and altitude [1, 1.2] to generate a 200-frame video sequence for training. We also rotate the virtual camera around the center of all keyframes to generate 100 validation views with only the static background being visible. |
| Hardware Specification | Yes | This training procedure spans approximately two hours on four NVIDIA A100-SXM-80GB GPUs. |
| Software Dependencies | No | The paper does not provide specific software versions for its dependencies. |
| Experiment Setup | Yes | The optimization takes 100k iterations with batch size 1024 and an exponentially decayed learning rate from 10 3 to 10 5. For scenes with a mixture of dynamic objects and shadows, we apply shadow decay and set λρ=0.1. We set λρ=0.001 for scenes featuring view-correlated dynamic shadows only. We experimentally found that the optimal choice of the hyperparameters, especially λb, λr and the skewness k, are strongly influenced by the level of object motion, camera motion, and video length. Therefore, we performed a grid search on our synthetic and held-out real-world scenes, and some scenes from DAVIS [42], to establish a set of hyperparameters applicable to a variety of scenarios; details about hyperparameters can be found in the supplementary. |