DreamSparse: Escaping from Plato’s Cave with 2D Diffusion Model Given Sparse Views

Authors: Paul Yoo, Jiaxian Guo, Yutaka Matsuo, Shixiang (Shane) Gu

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results demonstrate that our framework can effectively synthesize novel view images from sparse views and outperforms baselines in both trained and open-set category images.
Researcher Affiliation Academia Paul Yoo Jiaxian Guo Yutaka Matsuo Shixiang Shane Gu The University of Tokyo {paulyoo, jiaxian.guo}@weblab.t.u-tokyo.ac.jp
Pseudocode No The paper describes the architecture and processes in detail in text and figures (e.g., Figure 2 for the overall pipeline) but does not include pseudocode or clearly labeled algorithm blocks.
Open Source Code No More results can be found on our project page: https://sites.google.com/view/ dreamsparse-webpage. (Checked link, states 'Code (coming soon)')
Open Datasets Yes Following Sparse Fusion [76], we perform experiments on real-world scenes from the Common Objects in 3D (CO3Dv2) [37]...We train and evaluate our framework on the CO3Dv2 [37] dataset s fewview_train and fewview_dev sequence sets respectively. ... We additionally train and evaluate our method and baselines on the cars category of the Shape Net [5] synthetic dataset of object renderings.
Dataset Splits Yes We train and evaluate our framework on the CO3Dv2 [37] dataset s fewview_train and fewview_dev sequence sets respectively. ... For computing evaluation metrics, we select 10 objects per category and sample 32 uniformly spaced camera poses from the held-out test split. We then randomly select a specified number of context views from the camera poses and evaluate novel view synthesis results on the rest of the poses.
Hardware Specification Yes We jointly train the geometry and the spatial modules on 8 A100-40GB GPUs for 3 days with a batch size of 15.
Software Dependencies Yes We use Stable Diffusion v1.5 [42] as the frozen pre-trained diffusion model and DDIM [56] to synthesize novel views with 20 denoising steps. ... We use a Res Net50 [12] backbone ... We employ a Transformer [64].
Experiment Setup Yes The resolutions of the feature map for the spatial guidance module and latent noise are set as 64 64 with spatial guidance weight λ = 2. The three transformers used in the geometry module all contain 4 layers... with a batch size of 15. To demonstrate our framework s generalization capability at object-level novel view synthesis, we trained our framework on a subset of 10 categories as specified in [37].