Era3D: High-Resolution Multiview Diffusion using Efficient Row-wise Attention

Authors: Peng Li, Yuan Liu, Xiaoxiao Long, Feihu Zhang, Cheng Lin, Mengfei Li, Xingqun Qi, Shanghang Zhang, Wei Xue, Wenhan Luo, Ping Tan, Wenping Wang, Qifeng Liu, Yike Guo

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Comprehensive experiments demonstrate the superior generation power of Era3D... We trained Era3D on a subset of Objaverse [10]... Our methodology is evaluated in two tasks, novel view synthesis (NVS) and 3D reconstruction. The NVS quality is evaluated by the Learned Perceptual Image Patch Similarity (LPIPS) [77]... The 3D reconstruction quality is evaluated by the Chamfer Distance (CD) and the Volume IOU... Quantitative comparisons of Chamfer Distance (CD) and Intersection over Union (Io U) are shown in Tab. 1.
Researcher Affiliation Collaboration 1HKUST 2HKU 3Dream Tech 4PKU 5Light Illusion
Pseudocode No The paper describes methods in prose and equations, but does not include structured pseudocode or algorithm blocks.
Open Source Code No Project page: https://penghtyx.github.io/Era3D/. Also, in the NeurIPS Paper Checklist, Q5: 'Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material?' Answer: [No] Justification: 'Our evaluation uses the public datasets. We do not provide code in Supplementary Material. But they will be made publicly available once they have been fully prepared.'
Open Datasets Yes Datasets. We trained Era3D on a subset of Objaverse [10]... Following the previous methodologies [32, 33], we evaluate the performance of Era3D on the Google Scanned Object [12] dataset, widely regarded as a standard benchmark for 3D generation tasks.
Dataset Splits No The paper states training details like batch size and steps, but does not specify the exact percentages or counts for training, validation, or test dataset splits.
Hardware Specification Yes We train Era3D on 16 H800 GPUs (each with 80 GB) using a batch size of 128 for 40, 000 steps.
Software Dependencies No Our implementation is built upon the open-source text-to-image model, SD2.1-unclip [51]... Even with Xformers [26], an accelerating library for attention, the efficiency of row-wise attention still outperforms existing methods by approximately twelve-fold as evident in Tab. 3. The paper does not provide specific version numbers for these software components.
Experiment Setup Yes Implementation details. Our implementation is built upon the open-source text-to-image model, SD2.1-unclip [51]. We train Era3D on 16 H800 GPUs (each with 80 GB) using a batch size of 128 for 40, 000 steps. We set the initial learning rate as 1e-4 and decreased it to 5e-5 after 5, 000 steps. The training process takes approximately 30 hours. To conduct classifier-free guidance (CFG) [18], we randomly omit the clip condition at a rate of 0.05. During inference, we employ the DDIM sampler [57] with 40 steps and a CFG scale of 3.0 for the generation.