MM-TTS: Multi-Modal Prompt Based Style Transfer for Expressive Text-to-Speech Synthesis

Authors: Wenhao Guan, Yishuang Li, Tao Li, Hukai Huang, Feng Wang, Jiayan Lin, Lingyan Huang, Lin Li, Qingyang Hong

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments on the MEAD-TTS dataset and out-of-domain datasets demonstrate that MM-TTS can achieve satisfactory results based on multimodal prompts.
Researcher Affiliation Academia 1 School of Informatics, Xiamen University, China 2 Institute of Artificial Intelligence, Xiamen University, China 3 School of Electronic Science and Engineering, Xiamen University, China whguan@stu.xmu.edu.cn,{lilin,qyhong}@xmu.edu.cn
Pseudocode No The paper does not contain any pseudocode or algorithm blocks.
Open Source Code No The paper states: "The audio samples and constructed dataset are available at https://multimodal-tts.github.io." This link points to a project website and explicitly mentions audio samples and a dataset, but not the source code for the methodology. It does not meet the requirement of providing a direct link to a code repository or explicitly stating code availability.
Open Datasets Yes Since there is no public dataset for multi-modal TTS, we construct a dataset named MEAD-TTS, which is related to the field of expressive talking head. Our experiments on the MEAD-TTS dataset and out-of-domain datasets demonstrate that MM-TTS can achieve satisfactory results based on multimodal prompts. The audio samples and constructed dataset are available at https://multimodal-tts.github.io. Datasets: We utilize the MEAD-TTS dataset for intra-domain evaluation. And the Libri TTS (Zen et al. 2019) transcriptions and speech clips are utilized for out-of-domain evaluation. For face based style transfer, we use the face images in Oulu-CASIA (Zhao et al. 2011) dataset and transcriptions in Libri TTS for out-of-domain evaluation. For text description based style transfer, we use Libri TTS transcriptions for out-of-domain evaluations.
Dataset Splits No The paper mentions using LJSpeech for pretraining and MEAD-TTS, Libri TTS, and Oulu-CASIA datasets for evaluation. It states "we select 100 samples from MEAD-TTS and Libri TTS testing sets for intra-domain and out-of-domain evaluation respectively." This indicates a test set, but it does not provide specific percentages or counts for training, validation, and test splits for the constructed MEAD-TTS dataset, nor does it specify how splits were managed for the other datasets to ensure reproducibility of the partitioning.
Hardware Specification Yes The proposed MM-TTS was trained for 200K iterations using Adam optimizer (Kingma and Ba 2014) on a single NVIDIA Ge Force RTX 2080Ti GPU for both the first text-to-mel stage and second refiner stage training pipeline.
Software Dependencies No The paper mentions using "Adam optimizer," a "pretrained Hi Fi-GAN" as vocoder, and the "whisper tool." However, it does not provide specific version numbers for any of these software components or libraries.
Experiment Setup Yes The proposed MM-TTS was trained for 200K iterations using Adam optimizer (Kingma and Ba 2014) on a single NVIDIA Ge Force RTX 2080Ti GPU for both the first text-to-mel stage and second refiner stage training pipeline. Additionally, we utilize a pretrained Hi Fi-GAN (Kong, Kim, and Bae 2020) as the neural vocoder to convert generated Mel-spectrogram to waveform. In practice, the speech clips are resampled to 16k Hz.