MambaTalk: Efficient Holistic Gesture Synthesis with Selective State Space Models

Authors: Zunnan Xu, Yukang Lin, Haonan Han, Sicheng Yang, Ronghui Li, Yachao Zhang, Xiu Li

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Subjective and objective experiments demonstrate that our method surpasses the performance of state-of-the-art models. (Abstract); Extensive experiments and analyses demonstrate the effectiveness of our proposed method. (Section 1, Contributions)
Researcher Affiliation Collaboration Zunnan Xu , Yukang Lin , Haonan Han , Sicheng Yang, Ronghui Li, Yachao Zhang , Xiu Li Shenzhen International Graduate School, Tsinghua University University Town of Shenzhen, Nanshan District, Shenzhen, Guangdong, P.R. China; Work done as intern at Tecent.
Pseudocode Yes The local scanning procedure is illustrated in Algorithm 25. (Appendix A.5); Algorithm 1 Local Scanning Process (Appendix A.5)
Open Source Code Yes Our project is publicly available at https://kkakkkka.github.io/Mamba Talk/. (Abstract)
Open Datasets Yes We train and evaluate on the BEAT2 dataset proposed by [32]. (Section 4.1); To evaluate the generalisable benefit of our method, we conduct experiments on a large-scale multimodal dataset known as BEAT (Body-Expression-Audio-Text) [33]. (Appendix A.3)
Dataset Splits Yes We split datasets into 85%/7.5%/7.5% for the train/val/test set. (Section 4.1); Furthermore, we implement the conventional approach of partitioning the dataset into distinct training, validation, and testing subsets, ensuring consistency with the data partitioning scheme utilized in prior research to uphold the integrity of the comparison. (Appendix A.3)
Hardware Specification Yes All experiments are conducted using one NVIDIA A100 GPU. (Section 4.2); We measured the runtime of various components on the NVIDIA A100 GPU in our method... (Appendix A.2)
Software Dependencies No The paper mentions using the Adam optimizer and other libraries like VQVAEs, but does not provide specific version numbers for these software dependencies to ensure reproducibility.
Experiment Setup Yes We utilize the Adam optimizer with a learning rate of 2.5e-4. To maintain stability, we apply gradient norm clipping at a value of 0.99. In the construction of the VQVAEs, we employ a uniform initialization for the codebook, setting the codebook entries to feature lengths of 512 and establishing the codebook size at 256. The numerical distribution range for the codebook initialization is defined as [ 1/codebook_size, 1/codebook_size). ... The VQVAEs are trained for 200 epochs, with a learning rate of 2.5e-4 for the first 195 epochs, which is then reduced to 2.5e-5 for the final 5 epochs. During the second stage, the model is trained for 100 epochs. (Section 4.2); The total loss L is a weighted sum of the categorical and latent reconstruction losses, with α and β serving as balance hyper-parameters: L = αLcls + βLreclatent, where α = 1 and β = 3 for hands, upper and lower body motion. For facial motion, we set α = 0 and β = 3. (Section 3.3)