Audio2Head: Audio-driven One-shot Talking-head Generation with Natural Head Motion

Authors: Suzhen Wang, Lincheng Li, Yu Ding, Changjie Fan, Xin Yu

IJCAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate that our method produces videos with plausible head motions, synchronized facial expressions, and stable backgrounds and outperforms the state-of-the-art.
Researcher Affiliation Collaboration Suzhen Wang1 , Lincheng Li1 , Yu Ding1 , Changjie Fan1 , Xin Yu2 1 Virtual Human Group, Netease Fuxi AI Lab, China 2University of Technology Sydney
Pseudocode No The paper contains mathematical formulations and architecture diagrams, but no explicit pseudocode or algorithm blocks.
Open Source Code No To ensure proper use, we will release our code and models to promote the progress in detecting fake videos.
Open Datasets Yes We use prevalent benchmark datasets Vox Celeb [Nagrani et al., 2017], GRID [Cooke et al., 2006] and LRW [Chung and Zisserman, 2016] to evaluate the proposed method.
Dataset Splits Yes We split each dataset into training and testing sets following the setting of previous works.
Hardware Specification Yes NH is trained on Vox Celeb for one day on one RTX 2080 Ti with batchsize 64. ... The training of ND and NI takes 3 days with batchsize 28, and that of NM takes one week with batchsize 4 on 4 RTX 2080 Ti.
Software Dependencies No All our networks are implemented using Py Torch.
Experiment Setup Yes We adopt Adam optimizer during training, with an initial learning rate of 2e-4 and weight decay to 2e-6. ... NH is trained on Vox Celeb for one day on one RTX 2080 Ti with batchsize 64. ... The training of ND and NI takes 3 days with batchsize 28, and that of NM takes one week with batchsize 4 on 4 RTX 2080 Ti.