MM-TTS: Multi-Modal Prompt Based Style Transfer for Expressive Text-to-Speech Synthesis
Authors: Wenhao Guan, Yishuang Li, Tao Li, Hukai Huang, Feng Wang, Jiayan Lin, Lingyan Huang, Lin Li, Qingyang Hong
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments on the MEAD-TTS dataset and out-of-domain datasets demonstrate that MM-TTS can achieve satisfactory results based on multimodal prompts. |
| Researcher Affiliation | Academia | 1 School of Informatics, Xiamen University, China 2 Institute of Artificial Intelligence, Xiamen University, China 3 School of Electronic Science and Engineering, Xiamen University, China whguan@stu.xmu.edu.cn,{lilin,qyhong}@xmu.edu.cn |
| Pseudocode | No | The paper does not contain any pseudocode or algorithm blocks. |
| Open Source Code | No | The paper states: "The audio samples and constructed dataset are available at https://multimodal-tts.github.io." This link points to a project website and explicitly mentions audio samples and a dataset, but not the source code for the methodology. It does not meet the requirement of providing a direct link to a code repository or explicitly stating code availability. |
| Open Datasets | Yes | Since there is no public dataset for multi-modal TTS, we construct a dataset named MEAD-TTS, which is related to the field of expressive talking head. Our experiments on the MEAD-TTS dataset and out-of-domain datasets demonstrate that MM-TTS can achieve satisfactory results based on multimodal prompts. The audio samples and constructed dataset are available at https://multimodal-tts.github.io. Datasets: We utilize the MEAD-TTS dataset for intra-domain evaluation. And the Libri TTS (Zen et al. 2019) transcriptions and speech clips are utilized for out-of-domain evaluation. For face based style transfer, we use the face images in Oulu-CASIA (Zhao et al. 2011) dataset and transcriptions in Libri TTS for out-of-domain evaluation. For text description based style transfer, we use Libri TTS transcriptions for out-of-domain evaluations. |
| Dataset Splits | No | The paper mentions using LJSpeech for pretraining and MEAD-TTS, Libri TTS, and Oulu-CASIA datasets for evaluation. It states "we select 100 samples from MEAD-TTS and Libri TTS testing sets for intra-domain and out-of-domain evaluation respectively." This indicates a test set, but it does not provide specific percentages or counts for training, validation, and test splits for the constructed MEAD-TTS dataset, nor does it specify how splits were managed for the other datasets to ensure reproducibility of the partitioning. |
| Hardware Specification | Yes | The proposed MM-TTS was trained for 200K iterations using Adam optimizer (Kingma and Ba 2014) on a single NVIDIA Ge Force RTX 2080Ti GPU for both the first text-to-mel stage and second refiner stage training pipeline. |
| Software Dependencies | No | The paper mentions using "Adam optimizer," a "pretrained Hi Fi-GAN" as vocoder, and the "whisper tool." However, it does not provide specific version numbers for any of these software components or libraries. |
| Experiment Setup | Yes | The proposed MM-TTS was trained for 200K iterations using Adam optimizer (Kingma and Ba 2014) on a single NVIDIA Ge Force RTX 2080Ti GPU for both the first text-to-mel stage and second refiner stage training pipeline. Additionally, we utilize a pretrained Hi Fi-GAN (Kong, Kim, and Bae 2020) as the neural vocoder to convert generated Mel-spectrogram to waveform. In practice, the speech clips are resampled to 16k Hz. |