Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

MM-TTS: Multi-Modal Prompt Based Style Transfer for Expressive Text-to-Speech Synthesis

Authors: Wenhao Guan, Yishuang Li, Tao Li, Hukai Huang, Feng Wang, Jiayan Lin, Lingyan Huang, Lin Li, Qingyang Hong

AAAI 2024 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments on the MEAD-TTS dataset and out-of-domain datasets demonstrate that MM-TTS can achieve satisfactory results based on multimodal prompts.
Researcher Affiliation Academia 1 School of Informatics, Xiamen University, China 2 Institute of Artificial Intelligence, Xiamen University, China 3 School of Electronic Science and Engineering, Xiamen University, China EMAIL,EMAIL
Pseudocode No The paper does not contain any pseudocode or algorithm blocks.
Open Source Code No The paper states: "The audio samples and constructed dataset are available at https://multimodal-tts.github.io." This link points to a project website and explicitly mentions audio samples and a dataset, but not the source code for the methodology. It does not meet the requirement of providing a direct link to a code repository or explicitly stating code availability.
Open Datasets Yes Since there is no public dataset for multi-modal TTS, we construct a dataset named MEAD-TTS, which is related to the field of expressive talking head. Our experiments on the MEAD-TTS dataset and out-of-domain datasets demonstrate that MM-TTS can achieve satisfactory results based on multimodal prompts. The audio samples and constructed dataset are available at https://multimodal-tts.github.io. Datasets: We utilize the MEAD-TTS dataset for intra-domain evaluation. And the Libri TTS (Zen et al. 2019) transcriptions and speech clips are utilized for out-of-domain evaluation. For face based style transfer, we use the face images in Oulu-CASIA (Zhao et al. 2011) dataset and transcriptions in Libri TTS for out-of-domain evaluation. For text description based style transfer, we use Libri TTS transcriptions for out-of-domain evaluations.
Dataset Splits No The paper mentions using LJSpeech for pretraining and MEAD-TTS, Libri TTS, and Oulu-CASIA datasets for evaluation. It states "we select 100 samples from MEAD-TTS and Libri TTS testing sets for intra-domain and out-of-domain evaluation respectively." This indicates a test set, but it does not provide specific percentages or counts for training, validation, and test splits for the constructed MEAD-TTS dataset, nor does it specify how splits were managed for the other datasets to ensure reproducibility of the partitioning.
Hardware Specification Yes The proposed MM-TTS was trained for 200K iterations using Adam optimizer (Kingma and Ba 2014) on a single NVIDIA Ge Force RTX 2080Ti GPU for both the first text-to-mel stage and second refiner stage training pipeline.
Software Dependencies No The paper mentions using "Adam optimizer," a "pretrained Hi Fi-GAN" as vocoder, and the "whisper tool." However, it does not provide specific version numbers for any of these software components or libraries.
Experiment Setup Yes The proposed MM-TTS was trained for 200K iterations using Adam optimizer (Kingma and Ba 2014) on a single NVIDIA Ge Force RTX 2080Ti GPU for both the first text-to-mel stage and second refiner stage training pipeline. Additionally, we utilize a pretrained Hi Fi-GAN (Kong, Kim, and Bae 2020) as the neural vocoder to convert generated Mel-spectrogram to waveform. In practice, the speech clips are resampled to 16k Hz.