reproducibilityindex.ai

Audio-Driven Co-Speech Gesture Video Generation

Authors: Xian Liu, Qianyi Wu, Hang Zhou, Yuanqi Du, Wayne Wu, Dahua Lin, Ziwei Liu

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments demonstrate that our framework renders realistic and vivid co-speech gesture video. Demo video and more resources can be found in: https://alvinliu0.github.io/projects/ANGIE
Researcher Affiliation	Collaboration	Xian Liu1, Qianyi Wu2, Hang Zhou1, Yuanqi Du3, Wayne Wu4, Dahua Lin1,4, Ziwei Liu5 1Multimedia Laboratory, The Chinese University of Hong Kong 2Monash University 3Cornell University 4Shanghai AI Laboratory 5S-Lab, Nanyang Technological University
Pseudocode	No	The paper does not contain any structured pseudocode or algorithm blocks. It provides figures and mathematical equations but no algorithm listings.
Open Source Code	No	Besides, although the code and data are not included, as promised we will make the code, models and data publicly available.
Open Datasets	Yes	Pose, Audio, Transcript, Style (PATS) is a large-scale dataset of 25 speakers with aligned pose, audio and transcripts [1, 2, 21].
Dataset Splits	Yes	We randomly split the segments into 90% for training and 10% for evaluation.
Hardware Specification	Yes	The overall framework is implemented in Py Torch [37] and trained on one 16G Tesla V100 GPU for three days.
Software Dependencies	No	The paper mentions software like PyTorch, Librosa, and OpenPose, but does not specify their version numbers (e.g., "implemented in Py Torch [37]", "audio mfcc features amfcc R28 12 are extracted by Librosa", "2D skeletons of the image frames are obtained by Open Pose [9]").
Experiment Setup	Yes	We sample T = 96 frame clips with stride 32 for training. ... the co-speech gesture pattern codebook size M for both relative shift-translation µ and factorial covariance change L are set to 512. ... The channel dimension ℓof each codebook entry d µ, d L as well as the encoded latent features e µ, e L are 512, while the temporal dimension T is set as T/8 = 12 ... The ϵ is set as 1 10 5 ... The commit loss trade-offs in LVQ are empirically set as β1 = β2 = 0.1. We optimize the gesture pattern VQ-VAE with Adam optimizer [28] of learning rate 3 10 5. ... the Transformer channel dimension is 768, and the attention layer is implemented in 12 heads with dropout probability of 0.1.