LRM: Large Reconstruction Model for Single Image to 3D
Authors: Yicong Hong, Kai Zhang, Jiuxiang Gu, Sai Bi, Yang Zhou, Difan Liu, Feng Liu, Kalyan Sunkavalli, Trung Bui, Hao Tan
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We propose the first Large Reconstruction Model (LRM) that predicts the 3D model of an object from a single input image within just 5 seconds... We train our model in an end-to-end manner on massive multi-view data containing around 1 million objects... Through experiments, we show that LRM can reconstruct high-fidelity 3D shapes from a wide range of images... To numerically study the design choices of our approach, we randomly acquired 50 unseen 3D shapes from the Objaverse and 50 unseen videos from the Mv Img Net dataset, respectively. For each shape, we pre-process 15 reference views and pass five of them to our model one by one to reconstruct the same object, and evaluate the rendered images using all 15 reference views... We provide a quantitative comparison to the stat-of-the-art methods... We evaluate the effect of data, model hyper-parameters, and training methods on the performance of LRM, measuring by PSNR, CLIP-Similarity (Radford et al., 2021), SSIM (Wang et al., 2004) and LPIPS (Zhang et al., 2018) of the rendered novel views. |
| Researcher Affiliation | Collaboration | Yicong Hong12 Kai Zhang1 Jiuxiang Gu1 Sai Bi1 Yang Zhou1 Difan Liu1 Feng Liu1 Kalyan Sunkavalli1 Trung Bui1 Hao Tan1 1Adobe Research 2Australian National Univeristy |
| Pseudocode | No | No pseudocode or algorithm blocks are explicitly labeled or formatted as such in the paper. |
| Open Source Code | No | Our LRM is built by integrating the publicly available codebases of threestudio7 (Guo et al., 2023), x-transformers8, and DINO9 (Caron et al., 2021), and the model is trained using publicly available data from Objaverse (Deitke et al., 2023) and Mv Img Net (Yu et al., 2023). We include very comprehensive data pre-processing, network architecture, and training details in this paper, which greatly facilitate reproducing our LRM. - While they say it facilitates reproduction and refer to open-source components, they do not explicitly state that *their* LRM code is open-sourced or provide a link to it. The links provided are for third-party libraries they utilized. |
| Open Datasets | Yes | LRM relies on abundant 3D data from Objaverse (Deitke et al., 2023) and MVImg Net (Yu et al., 2023), consisting of synthetic 3D assets and videos of objects in the real world, respectively, to learn a generalizable cross-shape 3D prior... and the model is trained using publicly available data from Objaverse (Deitke et al., 2023) and Mv Img Net (Yu et al., 2023). |
| Dataset Splits | No | The paper describes training on large datasets (Objaverse, MVImg Net) and evaluating on unseen data from these and other sources (e.g., Google Scanned Objects). While unseen data is used for evaluation, it does not explicitly provide percentages or counts for distinct training, validation, and test splits from a single dataset, nor does it specify how a 'validation set' was formally partitioned from the training data. |
| Hardware Specification | Yes | We train LRM on 128 NVIDIA (40G) A100 GPUs with batch size 1024 (1024 different shapes per iteration) for 30 epochs, taking about 3 days to complete... This entire process only takes less than 5 seconds to complete on a single NVIDIA A100 GPU. |
| Software Dependencies | No | We take the default Layer Norm (LN) implementation in Py Torch (Paszke et al., 2019)... Our LRM is built by integrating the publicly available codebases of threestudio7 (Guo et al., 2023), x-transformers8, and DINO9 (Caron et al., 2021)... The paper mentions software components and cites papers for them, but does not provide explicit version numbers (e.g., "PyTorch 1.9" or "DINO vX.Y") for these dependencies. |
| Experiment Setup | Yes | We train LRM on 128 NVIDIA (40G) A100 GPUs with batch size 1024 (1024 different shapes per iteration) for 30 epochs, taking about 3 days to complete. Each epoch contains one copy of the rendered image data from Objaverse and three copies of the video frame data from Mv Img Net to balance the amount of synthetic and real data. For each sample, we use 3 randomly chosen side views (i.e., the total views V 4) to supervise the shape reconstruction, and we set the coefficient λ 2.0 for LLPIPS. We apply the Adam W optimizer (Loshchilov & Hutter, 2017) and set the learning rate to 4ˆ10 4 with a cosine schedule (Loshchilov & Hutter, 2016). We numerically analyze the influence of data, training, and model hyper-parameters in the Appendix. |