Write-a-speaker: Text-based Emotional and Rhythmic Talking-head Generation

Authors: Lincheng Li, Suzhen Wang, Zhimeng Zhang, Yu Ding, Yixing Zheng, Xin Yu, Changjie Fan1911-1920

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on qualitative and quantitative results demonstrate that our algorithm achieves high-quality photorealistic talking-head videos including various facial expressions and head motions according to speech rhythms and outperforms the state-of-the-art.
Researcher Affiliation Collaboration 1 Netease Fuxi AI Lab 2 University of Technology Sydney {lilincheng, wangsuzhen, zhangzhimeng, dingyu01, zhengyixing01, fanchangjie}@corp.netease.com xin.yu@uts.edu.au
Pseudocode No The paper does not contain any pseudocode or clearly labeled algorithm blocks.
Open Source Code Yes Both datasets are released for research purposes3. 3https://github.com/FuxiVirtualHuman/Write-a-Speaker
Open Datasets Yes Both datasets are released for research purposes3. 3https://github.com/FuxiVirtualHuman/Write-a-Speaker
Dataset Splits No The paper mentions using a 'groundtruth test data' from the Mocap dataset but does not specify the training, validation, or test split percentages or sample counts.
Hardware Specification Yes We implement the system using Py Torch on a single GTX 2080Ti.
Software Dependencies No The paper mentions 'Py Torch' and 'Adam' optimizer but does not specify version numbers for any software dependencies.
Experiment Setup Yes The loss weights are set to λmou = 50, λupp = 100, α = 10, β = 100, and γ = 100. We use the Adam (Kingma and Ba 2014) optimizer for all networks. For training Gmou, Gupp and Ghed, we set β1 = 0.5, β2 = 0.99, ϵ = 10 8, batch size of 32, and set the initial learning rate as 0.0005 for the generators and 0.00001 for the discriminators. The learning rates of Gmou stay fixed in the first 400 epoches and linearly decay to zero within another 400 epoches. The learning rates of Gupp and Ghed keep unchanged in the first 50 epoches and linearly decay to zero within another 50 epoches. For training Gvid, we set β1 = 0.5, β2 = 0.999, ϵ = 10 8, batch size of 3, and initial learning rate of 0.0002 with linear decay to 0.0001 within 50 epochs.