Audio-Driven Co-Speech Gesture Video Generation
Authors: Xian Liu, Qianyi Wu, Hang Zhou, Yuanqi Du, Wayne Wu, Dahua Lin, Ziwei Liu
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate that our framework renders realistic and vivid co-speech gesture video. Demo video and more resources can be found in: https://alvinliu0.github.io/projects/ANGIE |
| Researcher Affiliation | Collaboration | Xian Liu1, Qianyi Wu2, Hang Zhou1, Yuanqi Du3, Wayne Wu4, Dahua Lin1,4, Ziwei Liu5 1Multimedia Laboratory, The Chinese University of Hong Kong 2Monash University 3Cornell University 4Shanghai AI Laboratory 5S-Lab, Nanyang Technological University |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. It provides figures and mathematical equations but no algorithm listings. |
| Open Source Code | No | Besides, although the code and data are not included, as promised we will make the code, models and data publicly available. |
| Open Datasets | Yes | Pose, Audio, Transcript, Style (PATS) is a large-scale dataset of 25 speakers with aligned pose, audio and transcripts [1, 2, 21]. |
| Dataset Splits | Yes | We randomly split the segments into 90% for training and 10% for evaluation. |
| Hardware Specification | Yes | The overall framework is implemented in Py Torch [37] and trained on one 16G Tesla V100 GPU for three days. |
| Software Dependencies | No | The paper mentions software like PyTorch, Librosa, and OpenPose, but does not specify their version numbers (e.g., "implemented in Py Torch [37]", "audio mfcc features amfcc R28 12 are extracted by Librosa", "2D skeletons of the image frames are obtained by Open Pose [9]"). |
| Experiment Setup | Yes | We sample T = 96 frame clips with stride 32 for training. ... the co-speech gesture pattern codebook size M for both relative shift-translation µ and factorial covariance change L are set to 512. ... The channel dimension ℓof each codebook entry d µ, d L as well as the encoded latent features e µ, e L are 512, while the temporal dimension T is set as T/8 = 12 ... The ϵ is set as 1 10 5 ... The commit loss trade-offs in LVQ are empirically set as β1 = β2 = 0.1. We optimize the gesture pattern VQ-VAE with Adam optimizer [28] of learning rate 3 10 5. ... the Transformer channel dimension is 768, and the attention layer is implemented in 12 heads with dropout probability of 0.1. |