StyleTalk: One-Shot Talking Head Generation with Controllable Speaking Styles

Authors: Yifeng Ma, Suzhen Wang, Zhipeng Hu, Changjie Fan, Tangjie Lv, Yu Ding, Zhidong Deng, Xin Yu

AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate that our method is capable of generating talking head videos with diverse speaking styles from only one portrait image and an audio clip while achieving authentic visual effects.
Researcher Affiliation Collaboration Yifeng Ma1*, Suzhen Wang2, Zhipeng Hu2,3, Changjie Fan2, Tangjie Lv2, Yu Ding2,3 , Zhidong Deng1 , Xin Yu4 1 Department of Computer Science and Technology, BNRist, THUAI, State Key Laboratory of Intelligent Technology and Systems, Tsinghua University 2 Virtual Human Group, Netease Fuxi AI Lab 3 Zhejiang University 4 University of Technology Sydney
Pseudocode No The paper does not contain explicit pseudocode or algorithm blocks.
Open Source Code Yes Project Page: https://github.com/Fuxi Virtual Human/styletalk.
Open Datasets Yes We construct our dataset based on the widely used datasets, MEAD (Wang et al. 2020) and HDTF (Zhang et al. 2021c).
Dataset Splits No The paper mentions training and testing phases and datasets (MEAD and HDTF), but does not provide explicit numerical splits (e.g., percentages or exact counts) for training, validation, and test sets. It states: 'We conduct the experiments in the self-driven setting on the test set, where the speaker and the speaking style are not seen during training.'
Hardware Specification Yes Dsync and Dstyle are trained on HDTF and MEAD for 12 hours on 4 RTX 3090 GPU with a learning rate of 0.0001. Ea, Es, Ed and Dtem are trained jointly on HDTF and MEAD for 4 hours on 2 RTX 3090 GPU with a learning rate of 0.0001.
Software Dependencies No Our framework is implemented by Pytorch. We employ Adam optimizer (Kingma and Ba 2014) for training. The paper mentions PyTorch and Adam optimizer but does not specify their version numbers or any other software dependencies with specific versions.
Experiment Setup Yes We adopt the training strategy proposed in Wang et al. (2022) by taking the assembled input {Ir, at w,t+w, V } in a sliding window. w is the window length and is set to 5. We employ Adam optimizer (Kingma and Ba 2014) for training. Dsync and Dstyle are trained on HDTF and MEAD for 12 hours on 4 RTX 3090 GPU with a learning rate of 0.0001. Er, Dsync, Dstyle is then frozen. Ea, Es, Ed and Dtem are trained jointly on HDTF and MEAD for 4 hours on 2 RTX 3090 GPU with a learning rate of 0.0001. We generate successive L = 64 frames δ1:L at one time as a clip. Lrec = µLL1(δ1:L, ˆδ1:L) + (1 µ)Lssim(δ1:L, ˆδ1:L), (8) where δ1:L and ˆδ1:L are the ground truth and reconstructed facial expressions respectively. µ is a ratio coefficient and is set to 0.1. Our total loss is given by a combination of the aforementioned loss terms: L = λrec Lrec + λtrip Ltrip + λsync Lsync + +λtem Ltem + λstyle Lstyle , (9) where we use λrec = 88, λtrip = 1, λsync = 1, λtem = 1 and λstyle = 1. where γ is the margin parameter and is set to 5.