reproducibilityindex.ai

Style2Talker: High-Resolution Talking Head Generation with Emotion Style and Art Style

Authors: Shuai Tan, Bin Ji, Ye Pan

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experimental results demonstrate our method outperforms existing state-of-the-art methods in terms of audio-lip synchronization and performance of both emotion style and art style.
Researcher Affiliation	Academia	Shuai Tan, Bin Ji, Ye Pan* Shanghai Jiao Tong University {tanshuai0219, bin.ji, whitneypanye}@sjtu.edu.cn
Pseudocode	No	The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code	No	The paper does not provide any statement or link indicating that the source code for the described methodology is publicly available.
Open Datasets	Yes	For the Style-E stage, we leverage MEAD dataset (Wang et al. 2020) with the synthetically generated textual descriptions for emotion styles. ... For the Style-A stage, we additionally utilize another audio-visual dataset HDTF (Zhang et al. 2021), which consists of talking videos from more than 300 speakers. To obtain the art style reference, we use various art datasets (Huo et al. 2017a,b).
Dataset Splits	No	The paper does not explicitly provide specific training/validation/test dataset splits (e.g., percentages or exact counts).
Hardware Specification	No	The paper does not explicitly describe the hardware used to run its experiments, such as specific GPU or CPU models.
Software Dependencies	No	The paper mentions several tools and models (e.g., GPT-3, CLIP, Open Face, Style GAN) but does not provide specific version numbers for software dependencies or programming languages (e.g., Python 3.x, PyTorch 1.x).
Experiment Setup	Yes	To optimize the inference time, we employ a simpler and more efficient diffusion model as our motion generator. Furthermore, we leverage the DDIM (Song, Meng, and Ermon 2020) technique, which allows us to sample only 5 steps instead of 1000 during inference, which contributes to a substantial decrease in inference time. ... we import reconstruction loss Lrec and perceptual loss Lprec (Johnson, Alahi, and Fei-Fei 2016) to constrain the networks. ... λ = 0.1 refers to the weight of Lprec. ... During Style-A training, we freeze the weights of Es and G which are pretrained in Dual Style GAN, and optimize the remaining networks (i.e., Ec, Gm and R).