HumanSplat: Generalizable Single-Image Human Gaussian Splatting with Structure Priors

Authors: Panwang Pan, Zhuo Su, Chenguo Lin, Zhen Fan, Yongjie Zhang, Zeming Li, Tingting Shen, Yadong Mu, Yebin Liu

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Comprehensive experiments on standard benchmarks and in-the-wild images demonstrate that Human Splat surpasses existing state-of-the-art methods in achieving photorealistic novel-view synthesis.
Researcher Affiliation Collaboration Panwang Pan 1, Zhuo Su 1, Chenguo Lin 1,2, Zhen Fan1, Yongjie Zhang1, Zeming Li1, Tingting Shen3, Yadong Mu2, Yebin Liu 4 1Byte Dance, 2Peking University, 3Xiamen University, 4Tsinghua University
Pseudocode No The paper describes the proposed method but does not include structured pseudocode or algorithm blocks.
Open Source Code Yes Project page: https://humansplat.github.io.
Open Datasets Yes Human Splat is trained on 500 THuman2.0 [105], 1500 2K2K [109], and 1500 Twindom [106] high-fidelity human scans.
Dataset Splits No The paper mentions evaluating on specific scans from THuman2.0 and Twindom, and refers to a '2K2K validation set' in Figure 7(b), but it does not specify the exact percentages, sample counts, or methodology for creating the training/validation/test splits for the datasets used in a reproducible manner.
Hardware Specification Yes We conduct 200 epochs of 256-res training with a learning rate of 1e-5 and a batch size of 32 over 2 days on 8 A100 (40G VRAM) GPUs, while 512-res finetuning costs 2 additional days.
Software Dependencies No The paper mentions using 'Flash-Attention-v2' and 'xFormers' library but does not provide specific version numbers for these software dependencies, which are required for reproducible setup.
Experiment Setup Yes We evenly position 36 cameras across each of three hierarchical cycles to capture the full body, half body, and face, with rendering resolution set to 512 512. We conduct 200 epochs of 256-res training with a learning rate of 1e-5 and a batch size of 32 over 2 days on 8 A100 (40G VRAM) GPUs, while 512-res finetuning costs 2 additional days. We train our model with Adam W [110] optimizer, whose β1, β2 are set to 0.9 and 0.95 respectively. A weight decay of 0.05 is used on all parameters except those of the Layer Norm layers. We use a cosine learning rate decay scheduler with a 2000-step linear warm-up and the peak learning rate is set to 4e-4. For parts related to the head, hands, and arms, λj are set to 2, while the rest human part are set to 1. The parameters λi, λp and λm are set to 1. The model is trained for 80K iterations on 256-res and then fine-tuned on 512-res for another 20K iterations.