reproducibilityindex.ai

Chain of Generation: Multi-Modal Gesture Synthesis via Cascaded Conditional Control

Authors: Zunnan Xu, Yachao Zhang, Sicheng Yang, Ronghui Li, Xiu Li

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments and user studies confirm that our proposed approach achieves state-of-the-art performance.
Researcher Affiliation	Academia	Tsinghua Shenzhen International Graduate School, Tsinghua University University Town of Shenzhen, Nanshan District, Shenzhen, Guangdong, P.R. China
Pseudocode	No	The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code	No	The paper does not provide concrete access to source code for the methodology described.
Open Datasets	Yes	We conducted comprehensive experiments on a large-scale multimodal dataset called BEAT (Body Expression-Audio-Text) (Liu et al. 2022a).
Dataset Splits	Yes	Additionally, we followed the established practice of dividing the dataset into separate training, validation, and testing subsets, while maintaining the same data partitioning scheme as in previous work to ensure the fairness of the comparison.
Hardware Specification	Yes	All experiments are conducted using NVIDIA A100 GPUs.
Software Dependencies	No	The paper mentions tools like Fast Text and wav2vec2.0 but does not provide specific version numbers for software dependencies needed for replication.
Experiment Setup	Yes	We use the Adam optimizer with an initial learning rate of 0.00025, and set the batch size to 512. To ensure a fair comparison, we use N = 34 frame clips with a stride of 10 during training. The initial four frames are used as seed poses, and the model is trained to generate the remaining 30 poses, which correspond to a duration of 2 seconds. Our models utilize 47 joints in the BEAT dataset, including 38 hand joints and 9 body joints. The latent dimensions of the facial blendshape, audio, text, and gesture features are all set to 128, while the speaker embedding and emotion embedding are set to 8. We set τ = 0.1 in the rhythmic identification loss.