Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Word-Level Emotional Expression Control in Zero-Shot Text-to-Speech Synthesis
Authors: Tianrui Wang, Haoyu Wang, Meng Ge, Cheng Gong, Chunyu Qiang, Ziyang Ma, Zikang Huang, Guanrou Yang, Xiaobao Wang, Eng-Siong Chng, Xie Chen, Longbiao Wang, Jianwu Dang
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results show that We SCon effectively overcomes data scarcity, achieving state-of-the-art performance in word-level emotional expression control while preserving the strong zero-shot synthesis capabilities of the original TTS model. Our samples are available at https://wangtianrui.github.io/wescon. 4 Experiments 4.1 Experimental Setup 4.2 Experimental Results 4.2.1 Comparison with Reference Models Objective Evaluation Subjective Evaluation Capability on Zero-shot TTS 4.2.2 Ablation Study |
| Researcher Affiliation | Collaboration | Tianrui Wang1,2,3, Haoyu Wang1, Meng Ge1, Cheng Gong4, Chunyu Qiang1,5, Ziyang Ma3,6, Zikang Huang1, Guanrou Yang6, Xiaobao Wang1,2, Eng Siong Chng3, Xie Chen6, Longbiao Wang1,7 , Jianwu Dang8 1Tianjin Key Laboratory of Cognitive Computing and Application, College of Intelligence and Computing, Tianjin University, 2Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ), 3Nanyang Technological University, 4Tele AI, China Telecom, 5Kuaishou Technology, 6Shanghai Jiao Tong University, 7 Huiyan Technology (Tianjin), 8Shenzhen Institute of Advanced Technology |
| Pseudocode | No | The paper describes its methodology using descriptive text and illustrative figures (e.g., Figure 2, Figure 3, Figure 4) but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Justification: The data we use can be accessed through the cited references. We include the code and data preparation scripts in the supplementary material, and we plan to open-source them in the near future. |
| Open Datasets | Yes | Data and Model Configuration In the first stage, the content aligner is trained on 200 hours of nonemotional English-Chinese speech from Libri Speech-100-Clean [45] and AISHELL-1 [46]. In the second stage, the teacher model uses non-transition emotional train-set from ESD [21] as prompts to synthesize training samples based on emotion-transition texts generated by GPT-4o (see Appendix D for generation details and examples). We use the train-set, dev-set, and test-set of ESD7 [21] for training and evaluation. We evaluate the performance of our method on the standard zero-shot TTS task using the SEED test set (test-zh) [60]. To evaluate the generalization ability of our approach under an out-of-domain dataset, we conduct word-level emotion and speaking rate control experiments on the CASIA dataset [68]. |
| Dataset Splits | Yes | Data and Model Configuration In the first stage, the content aligner is trained on 200 hours of nonemotional English-Chinese speech from Libri Speech-100-Clean [45] and AISHELL-1 [46]. In the second stage, the teacher model uses non-transition emotional train-set from ESD [21] as prompts to synthesize training samples based on emotion-transition texts generated by GPT-4o (see Appendix D for generation details and examples). We use the train-set, dev-set, and test-set of ESD7 [21] for training and evaluation. For evaluation, we generate 1,000 emotion-speed-varying text samples (500 in Chinese and 500 in English) using the script provided in Appendix D. Only the top 50% of data, ranked by this composite score, are selected for self-training. |
| Hardware Specification | Yes | Setup of Training and Inference In the first stage, the content aligner is trained for 400k steps on 2 NVIDIA 3090 GPUs using Adam [48] with a learning rate linearly warmed up to 2.5e-4 over the first 10% of steps, then linearly decayed to 0. Each batch contains 90 seconds of speech. In the second stage, the student model is trained for 600k steps on 4 NVIDIA 3090 GPUs. |
| Software Dependencies | No | The paper mentions various software components and models like "Cosy Voice2 [37]", "GPT-4o [41]", "Whisper-Large [53]", "Paraformer [54]", "Wav LM-Large [55]", "emotion2vec-Large [57]", "wav2vec-based model [58]", "DNSMOS-Pro [59]", "Sense Voice model4 [61]", "Whisper model fine-tuned for speech emotion recognition5", "Resemblyzer6", "Big VGAN2 [62]", "Qwen2.5 [63]", "Diffusion Transformer (Di T) [64]", "flow matching [65]", "Conv Ne Xt V2 [66]", and "librosa [67]". However, it does not provide specific version numbers for these software packages or libraries, apart from the "o" in GPT-4o which is part of the model name. |
| Experiment Setup | Yes | Setup of Training and Inference In the first stage, the content aligner is trained for 400k steps on 2 NVIDIA 3090 GPUs using Adam [48] with a learning rate linearly warmed up to 2.5e-4 over the first 10% of steps, then linearly decayed to 0. Each batch contains 90 seconds of speech. In the second stage, the student model is trained for 600k steps on 4 NVIDIA 3090 GPUs. The TTS model is frozen for the first 20k steps to focus on training the emotion aligner. Each batch contains 40 seconds of speech, and Adam is used with a fixed learning rate of 5e-7. Repetition-aware top-k sampling [49] is applied during inference, with k = 50 and temperature = 0.9. |