Audio-Driven Co-Speech Gesture Video Generation

Authors: Xian Liu, Qianyi Wu, Hang Zhou, Yuanqi Du, Wayne Wu, Dahua Lin, Ziwei Liu

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate that our framework renders realistic and vivid co-speech gesture video. Demo video and more resources can be found in: https://alvinliu0.github.io/projects/ANGIE
Researcher Affiliation Collaboration Xian Liu1, Qianyi Wu2, Hang Zhou1, Yuanqi Du3, Wayne Wu4, Dahua Lin1,4, Ziwei Liu5 1Multimedia Laboratory, The Chinese University of Hong Kong 2Monash University 3Cornell University 4Shanghai AI Laboratory 5S-Lab, Nanyang Technological University
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks. It provides figures and mathematical equations but no algorithm listings.
Open Source Code No Besides, although the code and data are not included, as promised we will make the code, models and data publicly available.
Open Datasets Yes Pose, Audio, Transcript, Style (PATS) is a large-scale dataset of 25 speakers with aligned pose, audio and transcripts [1, 2, 21].
Dataset Splits Yes We randomly split the segments into 90% for training and 10% for evaluation.
Hardware Specification Yes The overall framework is implemented in Py Torch [37] and trained on one 16G Tesla V100 GPU for three days.
Software Dependencies No The paper mentions software like PyTorch, Librosa, and OpenPose, but does not specify their version numbers (e.g., "implemented in Py Torch [37]", "audio mfcc features amfcc R28 12 are extracted by Librosa", "2D skeletons of the image frames are obtained by Open Pose [9]").
Experiment Setup Yes We sample T = 96 frame clips with stride 32 for training. ... the co-speech gesture pattern codebook size M for both relative shift-translation µ and factorial covariance change L are set to 512. ... The channel dimension ℓof each codebook entry d µ, d L as well as the encoded latent features e µ, e L are 512, while the temporal dimension T is set as T/8 = 12 ... The ϵ is set as 1 10 5 ... The commit loss trade-offs in LVQ are empirically set as β1 = β2 = 0.1. We optimize the gesture pattern VQ-VAE with Adam optimizer [28] of learning rate 3 10 5. ... the Transformer channel dimension is 768, and the attention layer is implemented in 12 heads with dropout probability of 0.1.