Say Anything with Any Style

Authors: Shuai Tan, Bin Ji, Yu Ding, Ye Pan

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments demonstrate that our approach surpasses state-of-the-art methods in terms of both lip-synchronization and stylized expression.
Researcher Affiliation Collaboration Shuai Tan1, Bin Ji1, Yu Ding2, Ye Pan1* 1 Shanghai Jiao Tong University 2 Virtual Human Group, Netease Fuxi AI Lab
Pseudocode No The paper describes methods and processes but does not contain structured pseudocode or algorithm blocks.
Open Source Code No The paper does not include an explicit statement about releasing its source code or a link to a code repository for the described methodology.
Open Datasets Yes Two public datasets are leveraged to train and test our proposed SAAS: MEAD (Wang et al. 2020) and HDTF (Zhang et al. 2021b).
Dataset Splits No The paper mentions training and testing but does not provide specific details on validation dataset splits (percentages, sample counts, or explicit methodology).
Hardware Specification Yes Model training and testing are conducted on 4 NVIDIA Ge Force GTX 3090 with 24GB memory.
Software Dependencies No The paper states 'We implement our SAAS model with Pytorch' and mentions 'Incorporating the Adaptive moment estimation (Adam) optimizer', but it does not specify version numbers for any software dependencies.
Experiment Setup Yes We set w = 8, T = 32, N = 500 and ds = 256. Model training and testing are conducted on 4 NVIDIA Ge Force GTX 3090 with 24GB memory. Incorporating the Adaptive moment estimation (Adam) optimizer (Kingma and Ba 2014), the style codebook Cs and Style Encoder Es are pre-trained for 24 hours. Then, we froze weights of Cs and Es, and jointly train the whole network with the learning rate of 2e-4 for 500 and 300 epochs in audio-driven and video-driven settings, respectively.