Emotion Rendering for Conversational Speech Synthesis with Heterogeneous Graph-Based Context Modeling

Authors: Rui Liu, Yifan Hu, Yi Ren, Xiang Yin, Haizhou Li

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Both objective and subjective evaluations suggest that our model outperforms the baseline models in understanding and rendering emotions.
Researcher Affiliation Collaboration 1Inner Mongolian University, China 2Byte Dance 3Shenzhen Research Institute of Big Data, School of Data Science, The Chinese University of Hong Kong, Shenzhen, China 4National University of Singapore, Singapore
Pseudocode No The paper does not contain any sections or figures explicitly labeled 'Pseudocode' or 'Algorithm', nor does it present structured steps in a code-like format.
Open Source Code Yes Code and audio samples can be found at: https://github.com/walkerhyf/ECSS.
Open Datasets Yes We validate the ECSS on a recently public dataset for conversational speech synthesis called Daily Talk (Lee, Park, and Kim 2023)
Dataset Splits Yes We partition the data into training, validation, and test sets at a ratio of 8:1:1.
Hardware Specification Yes The model is trained on a Tesla V100 GPU with a batch size of 16 and 600k steps.
Software Dependencies No The paper mentions using pre-trained models like BERT and HiFi-GAN, and a G2P toolkit, but does not provide specific version numbers for these software components or any programming language and library versions (e.g., Python, PyTorch).
Experiment Setup Yes In the heterogeneous graph-based emotion context encoder, the dimension of the text node representation fuj is set to 512, and the dimensions of the remaining type node representations fej,fij,fsj, and faj are all set to 256. For multi-head attention-based methods, we set the head number as 8. ... The model is trained on a Tesla V100 GPU with a batch size of 16 and 600k steps. ... More detailed experimental settings are accessed in the Appendix section.