PV3D: A 3D Generative Model for Portrait Video Generation
Authors: Zhongcong Xu, Jianfeng Zhang, Jun Hao Liew, Wenqing Zhang, Song Bai, Jiashi Feng, Mike Zheng Shou
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Comprehensive experiments on various datasets including Vox Celeb (Nagrani et al., 2017), Celeb V-HQ (Zhu et al., 2022) and Talking Head-1KH (Wang et al., 2021a) well demonstrate the superiority of PV3D over previous state-of-the-art methods, both qualitatively and quantitatively. |
| Researcher Affiliation | Collaboration | Eric Zhongcong Xu1 , Jianfeng Zhang2 , Jun Hao Liew2, Wenqing Zhang2, Song Bai2, Jiashi Feng2, Mike Zheng Shou1 1 Show Lab, National University of Singapore 2 Byte Dance |
| Pseudocode | No | The paper does not contain any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code and models are available at https://showlab.github.io/pv3d. ... Our code and models are publicly available at https://showlab.github.io/pv3d. |
| Open Datasets | Yes | We experiment on three face video datasets, i.e., Vox Celeb (Nagrani et al., 2017; Chung et al., 2018), Celeb V-HQ (Zhu et al., 2022), and Talking Head-1KH (Wang et al., 2021a). |
| Dataset Splits | No | The paper states that for training, they sample two frames within a 16-frame span and use beta distributions for timesteps. It also mentions balancing video clips for each identity. However, it does not provide explicit train/validation/test splits (e.g., percentages or counts) for the datasets themselves. |
| Hardware Specification | Yes | Our model is trained for 300k iterations with a batch size of 16, which takes 58 hours on 8 Nvidia A100 GPUs. |
| Software Dependencies | No | Our model is implemented using Py Torch. However, no specific version number for PyTorch or other software dependencies is provided. |
| Experiment Setup | Yes | For each video, we sample two frames within a 16-frame span. Following Di GAN (Yu et al., 2022), we sample timesteps {ti, tj} from beta distributions. The resolution of the generated video is 512 512. We use a resolution of 64 and a sampling step of 48 for neural rendering during training. In inference stage, we use a rendering resolution of 128 for geometry visualization only. Each camera pose c has 25 dimensions, with 16 for extrinsics and 9 for intrinsics. Our model is implemented using Py Torch. We balance the loss terms by weighting factors: 1) λreg=0.6, λvid=0.65, λimg=1.0, λR1=2.0 for Vox Celeb; 2) λreg=0.05, λvid=0.65, λimg=1.0, λR1=4.0 for Celeb V-HQ; 3) λreg=0.5, λvid=0.65, λimg=1.0, λR1=2.0 for Talking Head-1KH. Our model is trained for 300k iterations with a batch size of 16, which takes 58 hours on 8 Nvidia A100 GPUs. |