Mimic: Speaking Style Disentanglement for Speech-Driven 3D Facial Animation
Authors: Hui Fu, Zeqing Wang, Ke Gong, Keze Wang, Tianshui Chen, Haojie Li, Haifeng Zeng, Wenxiong Kang
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive qualitative and quantitative experiments conducted on three publicly available datasets demonstrate that our approach outperforms state-of-the-art methods and is capable of capturing diverse speaking styles for speech-driven 3D facial animation. |
| Researcher Affiliation | Collaboration | Hui Fu1, 3, Zeqing Wang2, Ke Gong3, Keze Wang2, Tianshui Chen4, Haojie Li3, Haifeng Zeng3, Wenxiong Kang1* 1South China University of Technology 2Sun Yat-sen University 3X-ERA.ai 4Guangdong University of Technology |
| Pseudocode | No | The paper describes its method and architecture using text and diagrams (e.g., Figure 2) but does not include any explicit pseudocode or algorithm blocks. |
| Open Source Code | Yes | The source code and supplementary video are publicly available at: https://zeqingwang.github.io/Mimic/ |
| Open Datasets | Yes | We conduct a larger dataset called 3D-HDTF based on a high-quality 2D audio-visual dataset HDTF (Zhang et al. 2021). We perform quantitative evaluations on 3D-HDTF-Test-A and B, and qualitative evaluations on 3D-HDTF-Test-A and B, VOCA-Test, and BIWI-Test-B. For VOCASET and BIWI, we follow the settings in Code Talker, illustrated in the supplementary materials. Previous works (Xing et al. 2023) usually employ VOCASET (Cudeiro et al. 2019) and BIWI (Fanelli et al. 2010), which are limited in identities and richness of speech content. |
| Dataset Splits | No | The paper states: 'We use 172 sequences of 150 identities for training and conduct two testing sets: Test-A contains 38 sequences of 38 seen identities; Test-B contains 10 sequences of 10 unseen identities'. However, it does not specify a separate validation set or the percentages for the training, validation, and test splits relative to the total dataset size, nor does it explicitly mention the methodology for generating these splits from the full dataset. |
| Hardware Specification | Yes | Our framework is implemented by Pytorch, trained on a single RTX 4090 GPU for 150 epochs. |
| Software Dependencies | No | The paper states: 'Our framework is implemented by Pytorch' and mentions 'wav2vec 2.0', but it does not provide specific version numbers for these or any other software dependencies. |
| Experiment Setup | Yes | We use a 6-second window size for both the input sequence during training and the style reference sequence during inference. We use Adam optimizer with a learning rate of 0.0001 for training. The batch size is set to 6 for 3D-HDTF and 1 for both VOCASET and BIWI. Our framework is implemented by Pytorch, trained on a single RTX 4090 GPU for 150 epochs. Our full objective can be written as follows: L = λr Lr + λs Ls + λc Lc + λcon Lcon +λs cycle Ls cycle + λc cycle Lc cycle (10) where λr = 1, λs = 2.5 10 7, λc = 5.0 10 7, λcon = 5.0 10 7, λs cycle = 2.5 10 5, and λc cycle = 5.0 10 6. |