Enhancing Zero-Shot Multi-Speaker TTS with Negated Speaker Representations

Authors: Yejin Jeon, Yunsu Kim, Gary Geunbae Lee

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments and validations substantiate the efficacy of our proposed approach in preserving and harnessing speaker-specific attributes visa-vis alternative baseline models.
Researcher Affiliation Collaboration Yejin Jeon1, Yunsu Kim2, Gary Geunbae Lee1,3 1Graduate School of Artificial Intelligence, POSTECH, Republic of Korea 2ai Xplain Inc. Los Gatos, CA, USA 3Department of Computer Science and Engineering, POSTECH, Republic of Korea
Pseudocode No The paper describes the model architecture and components in text and figures, but does not include any pseudocode or algorithm blocks.
Open Source Code No The paper references open-source third-party tools and models (e.g., 'En Codec model from https://github.com/facebookresearch/encodec', 'pymcd library', 'Open AI s pretrained Whisper large-v2 model', 'jiwer package') but does not state that the source code for their own proposed methodology is released or provide a link to it.
Open Datasets Yes The benchmark Libri TTS train-clean-100 dataset (Zen et al. 2019) is used to conduct training and validation.
Dataset Splits No The paper states 'The benchmark Libri TTS train-clean-100 dataset (Zen et al. 2019) is used to conduct training and validation,' but it does not provide specific percentages, sample counts, or explicit details of the validation split used.
Hardware Specification Yes All experiments are conducted on a single RTX A6000 GPU with a batch size of 16, until step 300,000.
Software Dependencies No The paper mentions software like 'Librosa library', 'G2P library', 'Hi Fi-GAN vocoder', 'Whisper large-v2 model', and 'jiwer package' but does not provide specific version numbers for any of these software dependencies.
Experiment Setup Yes Adam optimization is utilized with hyperparameters β1 = 0.9, β2 = 0.98, and ϵ = 10 9. To convert the generated mel-spectrograms into audio, we employ the Hi Fi-GAN (Kong, Kim, and Bae 2020) vocoder. All experiments are conducted on a single RTX A6000 GPU with a batch size of 16, until step 300,000.