PV3D: A 3D Generative Model for Portrait Video Generation

Authors: Zhongcong Xu, Jianfeng Zhang, Jun Hao Liew, Wenqing Zhang, Song Bai, Jiashi Feng, Mike Zheng Shou

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Comprehensive experiments on various datasets including Vox Celeb (Nagrani et al., 2017), Celeb V-HQ (Zhu et al., 2022) and Talking Head-1KH (Wang et al., 2021a) well demonstrate the superiority of PV3D over previous state-of-the-art methods, both qualitatively and quantitatively.
Researcher Affiliation Collaboration Eric Zhongcong Xu1 , Jianfeng Zhang2 , Jun Hao Liew2, Wenqing Zhang2, Song Bai2, Jiashi Feng2, Mike Zheng Shou1 1 Show Lab, National University of Singapore 2 Byte Dance
Pseudocode No The paper does not contain any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Code and models are available at https://showlab.github.io/pv3d. ... Our code and models are publicly available at https://showlab.github.io/pv3d.
Open Datasets Yes We experiment on three face video datasets, i.e., Vox Celeb (Nagrani et al., 2017; Chung et al., 2018), Celeb V-HQ (Zhu et al., 2022), and Talking Head-1KH (Wang et al., 2021a).
Dataset Splits No The paper states that for training, they sample two frames within a 16-frame span and use beta distributions for timesteps. It also mentions balancing video clips for each identity. However, it does not provide explicit train/validation/test splits (e.g., percentages or counts) for the datasets themselves.
Hardware Specification Yes Our model is trained for 300k iterations with a batch size of 16, which takes 58 hours on 8 Nvidia A100 GPUs.
Software Dependencies No Our model is implemented using Py Torch. However, no specific version number for PyTorch or other software dependencies is provided.
Experiment Setup Yes For each video, we sample two frames within a 16-frame span. Following Di GAN (Yu et al., 2022), we sample timesteps {ti, tj} from beta distributions. The resolution of the generated video is 512 512. We use a resolution of 64 and a sampling step of 48 for neural rendering during training. In inference stage, we use a rendering resolution of 128 for geometry visualization only. Each camera pose c has 25 dimensions, with 16 for extrinsics and 9 for intrinsics. Our model is implemented using Py Torch. We balance the loss terms by weighting factors: 1) λreg=0.6, λvid=0.65, λimg=1.0, λR1=2.0 for Vox Celeb; 2) λreg=0.05, λvid=0.65, λimg=1.0, λR1=4.0 for Celeb V-HQ; 3) λreg=0.5, λvid=0.65, λimg=1.0, λR1=2.0 for Talking Head-1KH. Our model is trained for 300k iterations with a batch size of 16, which takes 58 hours on 8 Nvidia A100 GPUs.