Talking Face Generation by Conditional Recurrent Adversarial Network

Authors: Yang Song, Jingwen Zhu, Dawei Li, Andy Wang, Hairong Qi

IJCAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experimental results demonstrate the superiority of our framework over the state-of-the-art in terms of visual quality, lip sync accuracy, and smooth transition pertaining to both lip and facial movement. Sections 4.1-4.5 describe Datasets, Experimental Setup, Qualitative Evaluation, Quantitative Evaluation, and User Study, all indicative of empirical research.
Researcher Affiliation Collaboration 1University of Tennessee,Knoxville 2Samsung Research America 3IBM
Pseudocode No The paper provides network architecture diagrams and descriptions of its components, but no formal pseudocode or algorithm blocks.
Open Source Code Yes Code and More results: https://github.com/susanqq/Talking Face Generation
Open Datasets Yes We use TCD-TIMIT [Harte and Gillen, 2015], LRW [Chung and Zisserman, 2016], and Vox Celeb [Nagrani et al., 2017] in our experiments.
Dataset Splits Yes For a fair comparison and performance evaluation, we split each dataset into training and testing sets following the same experimental setting as previous works [Chung et al., 2017; Chen et al., 2018; Zhou et al., 2019].
Hardware Specification No The paper does not specify the hardware used for running the experiments (e.g., GPU model, CPU model, memory).
Software Dependencies No The paper mentions 'ADAM [Kingma and Ba, 2014] as the optimizer' but does not specify any software libraries or frameworks with version numbers (e.g., TensorFlow, PyTorch, or specific Python libraries).
Experiment Setup Yes We select ADAM [Kingma and Ba, 2014] as the optimizer with α = 0.0002 and β = 0.5 in the experiment. We first train our network without discriminators for 30 epochs, and then add DI, DV , Dl to finetune the network for another 15 epochs. The weights for LI, LV , and Ll are 1e-3, 1e-2, and 1e-3, respectively. For the input/ground truth images, face regions are cropped from the videos and resized to 128 x 128. For the audio inputs, we try different window sizes for MFCC feature and find that 350ms gives the best result.