Talking Face Generation by Conditional Recurrent Adversarial Network
Authors: Yang Song, Jingwen Zhu, Dawei Li, Andy Wang, Hairong Qi
IJCAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experimental results demonstrate the superiority of our framework over the state-of-the-art in terms of visual quality, lip sync accuracy, and smooth transition pertaining to both lip and facial movement. Sections 4.1-4.5 describe Datasets, Experimental Setup, Qualitative Evaluation, Quantitative Evaluation, and User Study, all indicative of empirical research. |
| Researcher Affiliation | Collaboration | 1University of Tennessee,Knoxville 2Samsung Research America 3IBM |
| Pseudocode | No | The paper provides network architecture diagrams and descriptions of its components, but no formal pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code and More results: https://github.com/susanqq/Talking Face Generation |
| Open Datasets | Yes | We use TCD-TIMIT [Harte and Gillen, 2015], LRW [Chung and Zisserman, 2016], and Vox Celeb [Nagrani et al., 2017] in our experiments. |
| Dataset Splits | Yes | For a fair comparison and performance evaluation, we split each dataset into training and testing sets following the same experimental setting as previous works [Chung et al., 2017; Chen et al., 2018; Zhou et al., 2019]. |
| Hardware Specification | No | The paper does not specify the hardware used for running the experiments (e.g., GPU model, CPU model, memory). |
| Software Dependencies | No | The paper mentions 'ADAM [Kingma and Ba, 2014] as the optimizer' but does not specify any software libraries or frameworks with version numbers (e.g., TensorFlow, PyTorch, or specific Python libraries). |
| Experiment Setup | Yes | We select ADAM [Kingma and Ba, 2014] as the optimizer with α = 0.0002 and β = 0.5 in the experiment. We first train our network without discriminators for 30 epochs, and then add DI, DV , Dl to finetune the network for another 15 epochs. The weights for LI, LV , and Ll are 1e-3, 1e-2, and 1e-3, respectively. For the input/ground truth images, face regions are cropped from the videos and resized to 128 x 128. For the audio inputs, we try different window sizes for MFCC feature and find that 350ms gives the best result. |