SyncTalkFace: Talking Face Generation with Precise Lip-Syncing via Audio-Lip Memory
Authors: Se Jin Park, Minsu Kim, Joanna Hong, Jeongsoo Choi, Yong Man Ro2062-2070
AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments are performed to verify that our method generates high-quality video with mouth shapes that best align with the input audio, outperforming previous state-of-the-art methods. |
| Researcher Affiliation | Academia | Image and Video Systems Lab, KAIST, South Korea {jinny960812, ms.k, joanna2587, jeongsoo.choi, ymro}@kaist.ac.kr |
| Pseudocode | No | The paper describes methodological steps in paragraph text and mathematical equations but does not include structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide an explicit statement about releasing source code for the described methodology or a direct link to a code repository. |
| Open Datasets | Yes | We train and evaluate on LRW (Chung and Zisserman 2016a) and LRS2 (Afouras et al. 2018) datasets. LRW is a word-level dataset with over 1000 utterances of 500 words. LRS2 is a sentence-level dataset with over 140,000 utterances. Both are from BBC News in the wild. |
| Dataset Splits | No | The paper mentions training and evaluation on LRW and LRS2 datasets, but does not provide specific training/validation/test split percentages, sample counts, or references to predefined full splits. |
| Hardware Specification | Yes | We train on 8 RTX 3090 GPUs and Intel Xeon Gold CPU. |
| Software Dependencies | No | The paper mentions 'PyTorch' as the framework and 'dlib' for landmark detection, but does not specify version numbers for these software components or other dependencies. |
| Experiment Setup | Yes | Hyper-parameters are empirically set: λ1 to 10, λ2, λ3, λ4, λ5, λ6 all to 0.01, and κ to 16. We take Wav2Lip as a baseline model and add Audio-Lip Memory and lip encoder which consists of a 3D convolutional layer followed by 2D convolutional layers to encode lip motion feature. We empirically find the optimum slot size to be 96. We first pre-train Sync Net on the target dataset and then train the framework with total loss L with the Adam optimizer using Py Torch. The learning rate is set to 1 × 10−4, except for the discriminator, whose is 5 × 10−4. |