Emotion Rendering for Conversational Speech Synthesis with Heterogeneous Graph-Based Context Modeling
Authors: Rui Liu, Yifan Hu, Yi Ren, Xiang Yin, Haizhou Li
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Both objective and subjective evaluations suggest that our model outperforms the baseline models in understanding and rendering emotions. |
| Researcher Affiliation | Collaboration | 1Inner Mongolian University, China 2Byte Dance 3Shenzhen Research Institute of Big Data, School of Data Science, The Chinese University of Hong Kong, Shenzhen, China 4National University of Singapore, Singapore |
| Pseudocode | No | The paper does not contain any sections or figures explicitly labeled 'Pseudocode' or 'Algorithm', nor does it present structured steps in a code-like format. |
| Open Source Code | Yes | Code and audio samples can be found at: https://github.com/walkerhyf/ECSS. |
| Open Datasets | Yes | We validate the ECSS on a recently public dataset for conversational speech synthesis called Daily Talk (Lee, Park, and Kim 2023) |
| Dataset Splits | Yes | We partition the data into training, validation, and test sets at a ratio of 8:1:1. |
| Hardware Specification | Yes | The model is trained on a Tesla V100 GPU with a batch size of 16 and 600k steps. |
| Software Dependencies | No | The paper mentions using pre-trained models like BERT and HiFi-GAN, and a G2P toolkit, but does not provide specific version numbers for these software components or any programming language and library versions (e.g., Python, PyTorch). |
| Experiment Setup | Yes | In the heterogeneous graph-based emotion context encoder, the dimension of the text node representation fuj is set to 512, and the dimensions of the remaining type node representations fej,fij,fsj, and faj are all set to 256. For multi-head attention-based methods, we set the head number as 8. ... The model is trained on a Tesla V100 GPU with a batch size of 16 and 600k steps. ... More detailed experimental settings are accessed in the Appendix section. |