Coherent 3D Scene Diffusion From a Single RGB Image

Authors: Manuel Dahnert, Angela Dai, Norman Müller, Matthias Niessner

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In the following sections, we will demonstrate the advantages of our method and contributions by evaluating it against common 3D scene reconstruction benchmarks. (...) In Fig. 3, we present qualitative comparisons of our approach against state-of-the-art methods for single-view 3D scene reconstruction. (...) In Tab. 1, we quantitatively compare the single-view shape reconstruction performance of our approach against baseline methods on the Pix3D dataset. (...) We conduct a series of detailed ablation studies to verify the effectiveness of our design decisions and contributions. The quantitative results are provided in Tab. 2.
Researcher Affiliation Collaboration Manuel Dahnert1 Angela Dai1 Norman Müller2 Matthias Nießner1 1Technical University of Munich, Germany 2Meta Reality Labs Zurich, Switzerland
Pseudocode No The paper describes its methods in text and figures but does not include any formal pseudocode or algorithm blocks.
Open Source Code No Justification: We will release the code after cleaning up and documenting it for easy usage.
Open Datasets Yes Following [23, 48, 77], we train and evaluate the performance of our 3D pose estimation on the SUN RGB-D [62] dataset with the official splits. (...) We train and evaluate the performance of our 3D shape reconstruction on the Pix3D [64] dataset, which contains images of common furniture objects with pixel-aligned 3D shapes from 9 object classes, comprising 10,046 images.
Dataset Splits No Following [23, 48, 77], we train and evaluate the performance of our 3D pose estimation on the SUN RGB-D [62] dataset with the official splits. (...) We use the train and test splits defined in [37], ensuring that 3D models between the respective splits do not overlap.
Hardware Specification Yes We train our models on a single RTX3090 with 24GB VRAM for 1000 epochs on Pix3D, for 500 epochs on SUN RGB-D and for 50 epochs of additional joint training using Lalign.
Software Dependencies No We implement our model in Py Torch[50] and use the Adam W [41] optimizer with a learning rate of 1 10 4 and β1 = 0.9, β2 = 0.999. We utilize an off-the-shelf 2D instance segmentation model, Mask2Former [5], which is pre-trained on COCO [36] using a Swin Transformer [39] backbone.
Experiment Setup Yes For all diffusion training processes, we uniformly sample time steps t = 1, ...T, T = 1000, and use a linear variance schedule with β1 = 0.0001 and βT = 0.02. We implement our model in Py Torch[50] and use the Adam W [41] optimizer with a learning rate of 1 10 4 and β1 = 0.9, β2 = 0.999. We train our models on a single RTX3090 with 24GB VRAM for 1000 epochs on Pix3D, for 500 epochs on SUN RGB-D and for 50 epochs of additional joint training using Lalign. During inference, we employ DDIM [61] with 100 steps to accelerate sampling speed. For classifierfree guidance [21], we drop the condition y with probability p = 0.8.