IM-3D: Iterative Multiview Diffusion and Reconstruction for High-Quality 3D Generation
Authors: Luke Melas-Kyriazi, Iro Laina, Christian Rupprecht, Natalia Neverova, Andrea Vedaldi, Oran Gafni, Filippos Kokkinos
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Compared to SDS, it reduces dramatically the number of evaluations of the 2D generator network. Using a fast sampler, generating the first version of the multi-view images requires only around 40 evaluations. Iterated generations are much shorter (as they start from a partially denoised result), at most doubling the total number of evaluations. This is a 10-100 reduction compared to SDS. The 3D reconstruction is also very fast, taking only a minute for the first version of the asset and a few seconds for the second or third. It also sidesteps typical SDS issues such as artifacts (e.g., saturated colors, Janus problem), lack of diversity (by avoiding mode seeking), and low yield (failure to converge). Compared to methods like (Li et al., 2023), IM-3D is slower but achieves much higher quality and does not require learning large reconstruction networks, offloading most of the work to 2D generation instead. ... 4. Experiments ... Quantitative comparison. Table 1 provides a quantitative comparison of our method to others. |
| Researcher Affiliation | Collaboration | 1Meta 2University of Oxford, Oxford, UK. |
| Pseudocode | No | The paper describes its methods in prose and with figures, but does not include any explicit pseudocode or algorithm blocks. |
| Open Source Code | No | The paper mentions a link to Open LRM (He & Wang, 2023) as a baseline model, but there is no explicit statement or link indicating that the authors have released the source code for IM-3D, the methodology described in this paper. |
| Open Datasets | Yes | The dataset J used to train our model consists of turn-tablelike videos of synthetic 3D objects. Several related papers in multi-view generation also use synthetic data, taking Objaverse (Deitke et al., 2022) or Objaverse-XL (Deitke et al., 2023) as a source. Here, we utilize an in-house collection of 3D assets of comparable quality, for which we generate textual descriptions using an image captioning network. |
| Dataset Splits | No | The paper mentions using a training dataset but does not specify explicit training/validation/test dataset splits, percentages, or sample counts for reproducibility. |
| Hardware Specification | Yes | We minimize the standard diffusion loss over a span of 5 days, employing 80 A100 GPUs with a total batch size of 240 and a learning rate of 1e-5. |
| Software Dependencies | No | The paper mentions various models and frameworks used (e.g., Emu Video, DPM++, Gaussian splatting, CLIP, SDXL) but does not provide specific version numbers for any software dependencies (e.g., Python, PyTorch, CUDA, or specific library versions). |
| Experiment Setup | Yes | In line with (Girdhar et al., 2023), we maintain the spatial convolutional and attention layers of Emu Video, fine-tuning only the temporal layers. We minimize the standard diffusion loss over a span of 5 days, employing 80 A100 GPUs with a total batch size of 240 and a learning rate of 1e-5. ... For Gaussian fitting, we initialize 5000 points at the center of the 3D space, and densify and prune the Gaussians every 50 iterations. We conduct optimization for 1200 iterations and execute Emu Video twice for 10 iterations each using the DPM solver (Lu et al., 2022) during this process, repeating this every 500 iterations. Empirically, we found that setting the weights to w LPIPS = 10, w SSIM = 0.2 and w Mask = 1 yields the best results during the fitting stage. |