Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Chain of Generation: Multi-Modal Gesture Synthesis via Cascaded Conditional Control
Authors: Zunnan Xu, Yachao Zhang, Sicheng Yang, Ronghui Li, Xiu Li
AAAI 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments and user studies confirm that our proposed approach achieves state-of-the-art performance. |
| Researcher Affiliation | Academia | Tsinghua Shenzhen International Graduate School, Tsinghua University University Town of Shenzhen, Nanshan District, Shenzhen, Guangdong, P.R. China |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide concrete access to source code for the methodology described. |
| Open Datasets | Yes | We conducted comprehensive experiments on a large-scale multimodal dataset called BEAT (Body Expression-Audio-Text) (Liu et al. 2022a). |
| Dataset Splits | Yes | Additionally, we followed the established practice of dividing the dataset into separate training, validation, and testing subsets, while maintaining the same data partitioning scheme as in previous work to ensure the fairness of the comparison. |
| Hardware Specification | Yes | All experiments are conducted using NVIDIA A100 GPUs. |
| Software Dependencies | No | The paper mentions tools like Fast Text and wav2vec2.0 but does not provide specific version numbers for software dependencies needed for replication. |
| Experiment Setup | Yes | We use the Adam optimizer with an initial learning rate of 0.00025, and set the batch size to 512. To ensure a fair comparison, we use N = 34 frame clips with a stride of 10 during training. The initial four frames are used as seed poses, and the model is trained to generate the remaining 30 poses, which correspond to a duration of 2 seconds. Our models utilize 47 joints in the BEAT dataset, including 38 hand joints and 9 body joints. The latent dimensions of the facial blendshape, audio, text, and gesture features are all set to 128, while the speaker embedding and emotion embedding are set to 8. We set τ = 0.1 in the rhythmic identification loss. |