Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
MM-TTS: Multi-Modal Prompt Based Style Transfer for Expressive Text-to-Speech Synthesis
Authors: Wenhao Guan, Yishuang Li, Tao Li, Hukai Huang, Feng Wang, Jiayan Lin, Lingyan Huang, Lin Li, Qingyang Hong
AAAI 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments on the MEAD-TTS dataset and out-of-domain datasets demonstrate that MM-TTS can achieve satisfactory results based on multimodal prompts. |
| Researcher Affiliation | Academia | 1 School of Informatics, Xiamen University, China 2 Institute of Artificial Intelligence, Xiamen University, China 3 School of Electronic Science and Engineering, Xiamen University, China EMAIL,EMAIL |
| Pseudocode | No | The paper does not contain any pseudocode or algorithm blocks. |
| Open Source Code | No | The paper states: "The audio samples and constructed dataset are available at https://multimodal-tts.github.io." This link points to a project website and explicitly mentions audio samples and a dataset, but not the source code for the methodology. It does not meet the requirement of providing a direct link to a code repository or explicitly stating code availability. |
| Open Datasets | Yes | Since there is no public dataset for multi-modal TTS, we construct a dataset named MEAD-TTS, which is related to the field of expressive talking head. Our experiments on the MEAD-TTS dataset and out-of-domain datasets demonstrate that MM-TTS can achieve satisfactory results based on multimodal prompts. The audio samples and constructed dataset are available at https://multimodal-tts.github.io. Datasets: We utilize the MEAD-TTS dataset for intra-domain evaluation. And the Libri TTS (Zen et al. 2019) transcriptions and speech clips are utilized for out-of-domain evaluation. For face based style transfer, we use the face images in Oulu-CASIA (Zhao et al. 2011) dataset and transcriptions in Libri TTS for out-of-domain evaluation. For text description based style transfer, we use Libri TTS transcriptions for out-of-domain evaluations. |
| Dataset Splits | No | The paper mentions using LJSpeech for pretraining and MEAD-TTS, Libri TTS, and Oulu-CASIA datasets for evaluation. It states "we select 100 samples from MEAD-TTS and Libri TTS testing sets for intra-domain and out-of-domain evaluation respectively." This indicates a test set, but it does not provide specific percentages or counts for training, validation, and test splits for the constructed MEAD-TTS dataset, nor does it specify how splits were managed for the other datasets to ensure reproducibility of the partitioning. |
| Hardware Specification | Yes | The proposed MM-TTS was trained for 200K iterations using Adam optimizer (Kingma and Ba 2014) on a single NVIDIA Ge Force RTX 2080Ti GPU for both the first text-to-mel stage and second refiner stage training pipeline. |
| Software Dependencies | No | The paper mentions using "Adam optimizer," a "pretrained Hi Fi-GAN" as vocoder, and the "whisper tool." However, it does not provide specific version numbers for any of these software components or libraries. |
| Experiment Setup | Yes | The proposed MM-TTS was trained for 200K iterations using Adam optimizer (Kingma and Ba 2014) on a single NVIDIA Ge Force RTX 2080Ti GPU for both the first text-to-mel stage and second refiner stage training pipeline. Additionally, we utilize a pretrained Hi Fi-GAN (Kong, Kim, and Bae 2020) as the neural vocoder to convert generated Mel-spectrogram to waveform. In practice, the speech clips are resampled to 16k Hz. |