Style2Talker: High-Resolution Talking Head Generation with Emotion Style and Art Style

Authors: Shuai Tan, Bin Ji, Ye Pan

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experimental results demonstrate our method outperforms existing state-of-the-art methods in terms of audio-lip synchronization and performance of both emotion style and art style.
Researcher Affiliation Academia Shuai Tan, Bin Ji, Ye Pan* Shanghai Jiao Tong University {tanshuai0219, bin.ji, whitneypanye}@sjtu.edu.cn
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide any statement or link indicating that the source code for the described methodology is publicly available.
Open Datasets Yes For the Style-E stage, we leverage MEAD dataset (Wang et al. 2020) with the synthetically generated textual descriptions for emotion styles. ... For the Style-A stage, we additionally utilize another audio-visual dataset HDTF (Zhang et al. 2021), which consists of talking videos from more than 300 speakers. To obtain the art style reference, we use various art datasets (Huo et al. 2017a,b).
Dataset Splits No The paper does not explicitly provide specific training/validation/test dataset splits (e.g., percentages or exact counts).
Hardware Specification No The paper does not explicitly describe the hardware used to run its experiments, such as specific GPU or CPU models.
Software Dependencies No The paper mentions several tools and models (e.g., GPT-3, CLIP, Open Face, Style GAN) but does not provide specific version numbers for software dependencies or programming languages (e.g., Python 3.x, PyTorch 1.x).
Experiment Setup Yes To optimize the inference time, we employ a simpler and more efficient diffusion model as our motion generator. Furthermore, we leverage the DDIM (Song, Meng, and Ermon 2020) technique, which allows us to sample only 5 steps instead of 1000 during inference, which contributes to a substantial decrease in inference time. ... we import reconstruction loss Lrec and perceptual loss Lprec (Johnson, Alahi, and Fei-Fei 2016) to constrain the networks. ... λ = 0.1 refers to the weight of Lprec. ... During Style-A training, we freeze the weights of Es and G which are pretrained in Dual Style GAN, and optimize the remaining networks (i.e., Ec, Gm and R).