reproducibilityindex.ai

SyncTalkFace: Talking Face Generation with Precise Lip-Syncing via Audio-Lip Memory

Authors: Se Jin Park, Minsu Kim, Joanna Hong, Jeongsoo Choi, Yong Man Ro2062-2070

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments are performed to verify that our method generates high-quality video with mouth shapes that best align with the input audio, outperforming previous state-of-the-art methods.
Researcher Affiliation	Academia	Image and Video Systems Lab, KAIST, South Korea {jinny960812, ms.k, joanna2587, jeongsoo.choi, ymro}@kaist.ac.kr
Pseudocode	No	The paper describes methodological steps in paragraph text and mathematical equations but does not include structured pseudocode or algorithm blocks.
Open Source Code	No	The paper does not provide an explicit statement about releasing source code for the described methodology or a direct link to a code repository.
Open Datasets	Yes	We train and evaluate on LRW (Chung and Zisserman 2016a) and LRS2 (Afouras et al. 2018) datasets. LRW is a word-level dataset with over 1000 utterances of 500 words. LRS2 is a sentence-level dataset with over 140,000 utterances. Both are from BBC News in the wild.
Dataset Splits	No	The paper mentions training and evaluation on LRW and LRS2 datasets, but does not provide specific training/validation/test split percentages, sample counts, or references to predefined full splits.
Hardware Specification	Yes	We train on 8 RTX 3090 GPUs and Intel Xeon Gold CPU.
Software Dependencies	No	The paper mentions 'PyTorch' as the framework and 'dlib' for landmark detection, but does not specify version numbers for these software components or other dependencies.
Experiment Setup	Yes	Hyper-parameters are empirically set: λ1 to 10, λ2, λ3, λ4, λ5, λ6 all to 0.01, and κ to 16. We take Wav2Lip as a baseline model and add Audio-Lip Memory and lip encoder which consists of a 3D convolutional layer followed by 2D convolutional layers to encode lip motion feature. We empirically find the optimum slot size to be 96. We first pre-train Sync Net on the target dataset and then train the framework with total loss L with the Adam optimizer using Py Torch. The learning rate is set to 1 × 10−4, except for the discriminator, whose is 5 × 10−4.