CAT3D: Create Anything in 3D with Multi-View Diffusion Models

Authors: Ruiqi Gao, Aleksander Holynski, Philipp Henzler, Arthur Brussee, Ricardo Martin Brualla, Pratul Srinivasan, Jonathan Barron, Ben Poole

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We trained the multi-view diffusion model at the core of CAT3D on four datasets with camera pose annotations: Objaverse [73], CO3D [74], Real Estate10k [75] and MVImg Net [76]. We then evaluated CAT3D on the few-view reconstruction task (Section 4.1) and the single image to 3D task (Section 4.2), demonstrating qualitative and quantitative improvements over prior work. The design choices that led to CAT3D are ablated and discussed further in Section 4.3.
Researcher Affiliation Industry 1Google Deep Mind 2Google Research
Pseudocode No The paper includes architectural diagrams (e.g., Figure 7) but does not provide structured pseudocode or algorithm blocks.
Open Source Code No We have not open sourced the code used in this work, but the datasets we used are all publicly available (Re10K, CO3D, MVImg Net, Objaverse).
Open Datasets Yes We trained the multi-view diffusion model at the core of CAT3D on four datasets with camera pose annotations: Objaverse [73], CO3D [74], Real Estate10k [75] and MVImg Net [76].
Dataset Splits Yes CO3D [74] and Real Estate10K [75] are in-distribution datasets whose training splits were part of our training set (we use their test splits for evaluation), whereas DTU [77], LLFF [78] and the mip-Ne RF 360 dataset [79] are out-of-distribution datasets that were not part of the training dataset. We tested CAT3D on the 3, 6 and 9 view reconstruction tasks, with the same train and eval splits as [7].
Hardware Specification Yes Our model was trained for 16 days on 128 TPU-v4 chips. ... Our synthetic view sampling and 3D reconstruction process is run on 16 A100 GPUs.
Software Dependencies No The paper mentions software components and techniques like 'Flash Attention' and 'Zip-NeRF' but does not specify version numbers for any libraries or dependencies.
Experiment Setup Yes We fine-tune the full latent diffusion model for 1.4M iterations with a batch size of 128 and a learning rate of 5 10 5. ... Learning rate is logarithmically decayed from 0.04 to 10 3. The weight of the perceptual loss (LPIPS) is set to 0.25 for single image to 3D and few-view reconstruction on Real State10K, LLFF and DTU datasets, and to 1.0 for few-view reconstruction on CO3D an Mip Ne RF-360 dataets.