GeneFace: Generalized and High-Fidelity Audio-Driven 3D Talking Face Synthesis

Authors: Zhenhui Ye, Ziyue Jiang, Yi Ren, Jinglin Liu, Jinzheng He, Zhou Zhao

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments show that our method achieves more generalized and high-fidelity talking face generation compared to previous methods.
Researcher Affiliation Collaboration Zhenhui Ye1 , Ziyue Jiang1 , Yi Ren2, Jinglin Liu1, Jin Zheng He1, Zhou Zhao1 1School of Computer Science and Technology, Zhejiang University {zhenhuiye,jiangziyue,jinglinliu,jinzhenghe,zhaozhou}@zju.edu.cn 2Byte Dance ren.yi@bytedance.com
Pseudocode No The paper includes architectural diagrams and descriptions of the model components but does not provide any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Video samples and source code are available at https://geneface.github.io
Open Datasets Yes To learn robust audio-to-motion mapping, a large-scale lipreading corpus is needed. Hence we use LRS3-TED (Afouras et al., 2018) to train our variational generator and post-net 2. Additionally, a certain person s speaking video of a few minutes in length with an audio track is needed to learn a Ne RF-based person portrait renderer. To be specific, in order to compare with the state-of-the-art methods, we utilize the data set of Lu et al. (2021) and Guo et al. (2021), which consist of 5 videos of an average length of 6,000 frames in 25 fps.
Dataset Splits No The paper mentions using LRS3-TED and other datasets for training and testing, and uses the term 'test' in its evaluation. However, it does not provide explicit details about training, validation, and test splits (e.g., percentages or exact counts) for the datasets used in a reproducible manner.
Hardware Specification Yes We train the Gene Face on 1 NVIDIA RTX 3090 GPU, and the detailed training hyper-parameters of the variational generator, post-net, and Ne RF-based are listed in Appendix B.
Software Dependencies No The paper lists model configurations and hyperparameters in Appendix B.1, but it does not specify any software dependencies (e.g., libraries, frameworks, or operating systems) with version numbers.
Experiment Setup Yes Implementation Details. We train the Gene Face on 1 NVIDIA RTX 3090 GPU, and the detailed training hyper-parameters of the variational generator, post-net, and Ne RF-based are listed in Appendix B. For variational generator and post-net, it takes about 40k and 12k steps to converge (about 12 hours). For the Ne RF-based renderer, we train each model for 800k iterations (400k for head and 400k for the torso, respectively), which takes about 72 hours. Table 4: Hyper-parameter list