Chain of Generation: Multi-Modal Gesture Synthesis via Cascaded Conditional Control
Authors: Zunnan Xu, Yachao Zhang, Sicheng Yang, Ronghui Li, Xiu Li
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments and user studies confirm that our proposed approach achieves state-of-the-art performance. |
| Researcher Affiliation | Academia | Tsinghua Shenzhen International Graduate School, Tsinghua University University Town of Shenzhen, Nanshan District, Shenzhen, Guangdong, P.R. China |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide concrete access to source code for the methodology described. |
| Open Datasets | Yes | We conducted comprehensive experiments on a large-scale multimodal dataset called BEAT (Body Expression-Audio-Text) (Liu et al. 2022a). |
| Dataset Splits | Yes | Additionally, we followed the established practice of dividing the dataset into separate training, validation, and testing subsets, while maintaining the same data partitioning scheme as in previous work to ensure the fairness of the comparison. |
| Hardware Specification | Yes | All experiments are conducted using NVIDIA A100 GPUs. |
| Software Dependencies | No | The paper mentions tools like Fast Text and wav2vec2.0 but does not provide specific version numbers for software dependencies needed for replication. |
| Experiment Setup | Yes | We use the Adam optimizer with an initial learning rate of 0.00025, and set the batch size to 512. To ensure a fair comparison, we use N = 34 frame clips with a stride of 10 during training. The initial four frames are used as seed poses, and the model is trained to generate the remaining 30 poses, which correspond to a duration of 2 seconds. Our models utilize 47 joints in the BEAT dataset, including 38 hand joints and 9 body joints. The latent dimensions of the facial blendshape, audio, text, and gesture features are all set to 128, while the speaker embedding and emotion embedding are set to 8. We set τ = 0.1 in the rhythmic identification loss. |