MambaTalk: Efficient Holistic Gesture Synthesis with Selective State Space Models
Authors: Zunnan Xu, Yukang Lin, Haonan Han, Sicheng Yang, Ronghui Li, Yachao Zhang, Xiu Li
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Subjective and objective experiments demonstrate that our method surpasses the performance of state-of-the-art models. (Abstract); Extensive experiments and analyses demonstrate the effectiveness of our proposed method. (Section 1, Contributions) |
| Researcher Affiliation | Collaboration | Zunnan Xu , Yukang Lin , Haonan Han , Sicheng Yang, Ronghui Li, Yachao Zhang , Xiu Li Shenzhen International Graduate School, Tsinghua University University Town of Shenzhen, Nanshan District, Shenzhen, Guangdong, P.R. China; Work done as intern at Tecent. |
| Pseudocode | Yes | The local scanning procedure is illustrated in Algorithm 25. (Appendix A.5); Algorithm 1 Local Scanning Process (Appendix A.5) |
| Open Source Code | Yes | Our project is publicly available at https://kkakkkka.github.io/Mamba Talk/. (Abstract) |
| Open Datasets | Yes | We train and evaluate on the BEAT2 dataset proposed by [32]. (Section 4.1); To evaluate the generalisable benefit of our method, we conduct experiments on a large-scale multimodal dataset known as BEAT (Body-Expression-Audio-Text) [33]. (Appendix A.3) |
| Dataset Splits | Yes | We split datasets into 85%/7.5%/7.5% for the train/val/test set. (Section 4.1); Furthermore, we implement the conventional approach of partitioning the dataset into distinct training, validation, and testing subsets, ensuring consistency with the data partitioning scheme utilized in prior research to uphold the integrity of the comparison. (Appendix A.3) |
| Hardware Specification | Yes | All experiments are conducted using one NVIDIA A100 GPU. (Section 4.2); We measured the runtime of various components on the NVIDIA A100 GPU in our method... (Appendix A.2) |
| Software Dependencies | No | The paper mentions using the Adam optimizer and other libraries like VQVAEs, but does not provide specific version numbers for these software dependencies to ensure reproducibility. |
| Experiment Setup | Yes | We utilize the Adam optimizer with a learning rate of 2.5e-4. To maintain stability, we apply gradient norm clipping at a value of 0.99. In the construction of the VQVAEs, we employ a uniform initialization for the codebook, setting the codebook entries to feature lengths of 512 and establishing the codebook size at 256. The numerical distribution range for the codebook initialization is defined as [ 1/codebook_size, 1/codebook_size). ... The VQVAEs are trained for 200 epochs, with a learning rate of 2.5e-4 for the first 195 epochs, which is then reduced to 2.5e-5 for the final 5 epochs. During the second stage, the model is trained for 100 epochs. (Section 4.2); The total loss L is a weighted sum of the categorical and latent reconstruction losses, with α and β serving as balance hyper-parameters: L = αLcls + βLreclatent, where α = 1 and β = 3 for hands, upper and lower body motion. For facial motion, we set α = 0 and β = 3. (Section 3.3) |