reproducibilityindex.ai

Talking Face Generation by Conditional Recurrent Adversarial Network

Authors: Yang Song, Jingwen Zhu, Dawei Li, Andy Wang, Hairong Qi

IJCAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experimental results demonstrate the superiority of our framework over the state-of-the-art in terms of visual quality, lip sync accuracy, and smooth transition pertaining to both lip and facial movement. Sections 4.1-4.5 describe Datasets, Experimental Setup, Qualitative Evaluation, Quantitative Evaluation, and User Study, all indicative of empirical research.
Researcher Affiliation	Collaboration	1University of Tennessee,Knoxville 2Samsung Research America 3IBM
Pseudocode	No	The paper provides network architecture diagrams and descriptions of its components, but no formal pseudocode or algorithm blocks.
Open Source Code	Yes	Code and More results: https://github.com/susanqq/Talking Face Generation
Open Datasets	Yes	We use TCD-TIMIT [Harte and Gillen, 2015], LRW [Chung and Zisserman, 2016], and Vox Celeb [Nagrani et al., 2017] in our experiments.
Dataset Splits	Yes	For a fair comparison and performance evaluation, we split each dataset into training and testing sets following the same experimental setting as previous works [Chung et al., 2017; Chen et al., 2018; Zhou et al., 2019].
Hardware Specification	No	The paper does not specify the hardware used for running the experiments (e.g., GPU model, CPU model, memory).
Software Dependencies	No	The paper mentions 'ADAM [Kingma and Ba, 2014] as the optimizer' but does not specify any software libraries or frameworks with version numbers (e.g., TensorFlow, PyTorch, or specific Python libraries).
Experiment Setup	Yes	We select ADAM [Kingma and Ba, 2014] as the optimizer with α = 0.0002 and β = 0.5 in the experiment. We ﬁrst train our network without discriminators for 30 epochs, and then add DI, DV , Dl to ﬁnetune the network for another 15 epochs. The weights for LI, LV , and Ll are 1e-3, 1e-2, and 1e-3, respectively. For the input/ground truth images, face regions are cropped from the videos and resized to 128 x 128. For the audio inputs, we try different window sizes for MFCC feature and ﬁnd that 350ms gives the best result.