BetterDepth: Plug-and-Play Diffusion Refiner for Zero-Shot Monocular Depth Estimation

Authors: Xiang Zhang, Bingxin Ke, Hayko Riemenschneider, Nando Metzger, Anton Obukhov, Markus Gross, Konrad Schindler, Christopher Schroers

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental With efficient training on small-scale synthetic datasets, Better Depth achieves state-of-the-art zero-shot MDE performance on diverse public datasets and on in-the-wild scenes.4 Experiments and Analysis
Researcher Affiliation Collaboration 1ETH Zürich, 2Disney Research|Studios
Pseudocode Yes Algorithm 1 Better Depth Training Procedure
Open Source Code No This is research done in collaboration with a corporate research lab and we haven t been able to get clearance to release the code.
Open Datasets Yes We follow Marigold [17] and use 74K samples from two synthetic datasets Hypersim [33] and Virtual KITTI [2] for training. Also, the NeurIPS checklist mentions: Hypersim: https://github.com/apple/ml-hypersim, Virtual KITTI: https://europe.naverlabs.com/research-old2/computer-vision/proxy-virtual-worlds-vkitti-2/
Dataset Splits No We follow Marigold [17] and use 74K samples from two synthetic datasets Hypersim [33] and Virtual KITTI [2] for training. For evaluation, we employ five unseen datasets NYUv2 [28] (654 samples), KITTI [11] (652 samples from the Eigen test split [9]), ETH3D [43] (454 samples), Scan Net [6] (800 samples based on the Marigold split [17]), and DIODE [45] (325 indoor samples and 446 outdoor ones)... The paper defines training datasets and evaluation datasets, but not a distinct validation split.
Hardware Specification Yes The training takes around 1.5 days on a single NVIDIA RTX A6000 GPU.on an NVIDIA Ge Force RTX 4090 GPU.
Software Dependencies No The paper mentions using 'Adam optimizer [18]' but does not provide specific version numbers for software dependencies like Python, PyTorch, or CUDA.
Experiment Setup Yes Better Depth is trained for 5K iterations with batch size 32. The Adam optimizer [18] is used with the learning rate set to 3 10 5. We set the patch size w = 8 and the masking threshold η = 0.1 under the depth range [ 1, 1]. For inference, we apply the DDIM scheduler with 50-step sampling [44] and obtain the final result with 10 test-time ensemble members [17].