Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Audio-Driven Co-Speech Gesture Video Generation
Authors: Xian Liu, Qianyi Wu, Hang Zhou, Yuanqi Du, Wayne Wu, Dahua Lin, Ziwei Liu
NeurIPS 2022 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate that our framework renders realistic and vivid co-speech gesture video. Demo video and more resources can be found in: https://alvinliu0.github.io/projects/ANGIE |
| Researcher Affiliation | Collaboration | Xian Liu1, Qianyi Wu2, Hang Zhou1, Yuanqi Du3, Wayne Wu4, Dahua Lin1,4, Ziwei Liu5 1Multimedia Laboratory, The Chinese University of Hong Kong 2Monash University 3Cornell University 4Shanghai AI Laboratory 5S-Lab, Nanyang Technological University |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. It provides figures and mathematical equations but no algorithm listings. |
| Open Source Code | No | Besides, although the code and data are not included, as promised we will make the code, models and data publicly available. |
| Open Datasets | Yes | Pose, Audio, Transcript, Style (PATS) is a large-scale dataset of 25 speakers with aligned pose, audio and transcripts [1, 2, 21]. |
| Dataset Splits | Yes | We randomly split the segments into 90% for training and 10% for evaluation. |
| Hardware Specification | Yes | The overall framework is implemented in Py Torch [37] and trained on one 16G Tesla V100 GPU for three days. |
| Software Dependencies | No | The paper mentions software like PyTorch, Librosa, and OpenPose, but does not specify their version numbers (e.g., "implemented in Py Torch [37]", "audio mfcc features amfcc R28 12 are extracted by Librosa", "2D skeletons of the image frames are obtained by Open Pose [9]"). |
| Experiment Setup | Yes | We sample T = 96 frame clips with stride 32 for training. ... the co-speech gesture pattern codebook size M for both relative shift-translation µ and factorial covariance change L are set to 512. ... The channel dimension ℓof each codebook entry d µ, d L as well as the encoded latent features e µ, e L are 512, while the temporal dimension T is set as T/8 = 12 ... The ϵ is set as 1 10 5 ... The commit loss trade-offs in LVQ are empirically set as β1 = β2 = 0.1. We optimize the gesture pattern VQ-VAE with Adam optimizer [28] of learning rate 3 10 5. ... the Transformer channel dimension is 768, and the attention layer is implemented in 12 heads with dropout probability of 0.1. |