MVGamba: Unify 3D Content Generation as State Space Sequence Modeling
Authors: Xuanyu Yi, Zike Wu, Qiuhong Shen, Qingshan Xu, Pan Zhou, Joo-Hwee Lim, Shuicheng Yan, Xinchao Wang, Hanwang Zhang
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate that MVGamba outperforms state-of-the-art baselines in all 3D content generation scenarios with approximately only 0.1 of the model size. We conducted comprehensive qualitative and quantitative experiments to verify the efficacy of our proposed MVGamba. |
| Researcher Affiliation | Collaboration | 1Nanyang Technological University 2National University of Singapore 3University of British Columbia 4Singapore Management University 5Institute for Infocomm Research 6Skywork AI |
| Pseudocode | Yes | Figure 14: The pseudo code for Gaussian parameter constraint. We provide the detailed pseudo code of Gaussian parameterization in Figure 14 for better reproduciblility. |
| Open Source Code | Yes | The codes are available at https://github.com/Skywork AI/MVGamba. |
| Open Datasets | Yes | Training dataset. We obtain the multi-view images from Objaverse [7] for MVGamba pre-training. We use well-adopted PSNR, SSIM and LPIPS for quantitative measurement in the GSO [76] dataset following [18]. |
| Dataset Splits | No | The paper specifies training with 'input views' and 'supervision' views ('another random set of 6 views as supervision'), and then evaluates on 'test views' and datasets like GSO. However, it does not explicitly define a separate 'validation' dataset split by percentage or sample count, which is typically used for hyperparameter tuning or early stopping. |
| Hardware Specification | Yes | MVGamba is trained on 32 NVIDIA A100 (80G) with batch size 512 for about 2 days. This process...completes in less than 5 seconds (4.5 seconds for multi-view image generation and 0.03 second for predicting Gaussians and real-time rendering) on a single NVIDIA A800 (80G) GPU, making it well-suited for online deployment scenarios. |
| Software Dependencies | No | The paper mentions software components like 'Adam W optimizer' and 'mixed-precision training with BF16 data type' and refers to the 'Open3D' library for TSDF fusion, but it does not provide specific version numbers for programming languages (e.g., Python), deep learning frameworks (e.g., PyTorch, TensorFlow), or other key software dependencies. |
| Experiment Setup | Yes | MVGamba is trained on 32 NVIDIA A100 (80G) with batch size 512 for about 2 days. We adopt gradient checkpointing and mixed-precision training with BF16 data type to ensure efficient training and inference. We use the Adam W optimizer with learning rate 1 10 3 and weight decay 0.05, following a linear learning rate warm-up for 15 epochs with cosine decay to 1 10 5. The output Gaussians are rendered at 512 512 resolution for mean square error loss and resized to 256 256 for LPIPS loss for memory efficiency. The trade-off coefficients that balancing each loss were set as λmask = 1.0, λLPIPS = 0.6 and λreg = 0.001. We also follow the common practice [19] to clip the gradient with a maximum norm of 1.0. The detail of MVGamba model configuration is included in Appendix D (Table 5). |