Audio2Head: Audio-driven One-shot Talking-head Generation with Natural Head Motion
Authors: Suzhen Wang, Lincheng Li, Yu Ding, Changjie Fan, Xin Yu
IJCAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate that our method produces videos with plausible head motions, synchronized facial expressions, and stable backgrounds and outperforms the state-of-the-art. |
| Researcher Affiliation | Collaboration | Suzhen Wang1 , Lincheng Li1 , Yu Ding1 , Changjie Fan1 , Xin Yu2 1 Virtual Human Group, Netease Fuxi AI Lab, China 2University of Technology Sydney |
| Pseudocode | No | The paper contains mathematical formulations and architecture diagrams, but no explicit pseudocode or algorithm blocks. |
| Open Source Code | No | To ensure proper use, we will release our code and models to promote the progress in detecting fake videos. |
| Open Datasets | Yes | We use prevalent benchmark datasets Vox Celeb [Nagrani et al., 2017], GRID [Cooke et al., 2006] and LRW [Chung and Zisserman, 2016] to evaluate the proposed method. |
| Dataset Splits | Yes | We split each dataset into training and testing sets following the setting of previous works. |
| Hardware Specification | Yes | NH is trained on Vox Celeb for one day on one RTX 2080 Ti with batchsize 64. ... The training of ND and NI takes 3 days with batchsize 28, and that of NM takes one week with batchsize 4 on 4 RTX 2080 Ti. |
| Software Dependencies | No | All our networks are implemented using Py Torch. |
| Experiment Setup | Yes | We adopt Adam optimizer during training, with an initial learning rate of 2e-4 and weight decay to 2e-6. ... NH is trained on Vox Celeb for one day on one RTX 2080 Ti with batchsize 64. ... The training of ND and NI takes 3 days with batchsize 28, and that of NM takes one week with batchsize 4 on 4 RTX 2080 Ti. |