reproducibilityindex.ai

Enhancing Zero-Shot Multi-Speaker TTS with Negated Speaker Representations

Authors: Yejin Jeon, Yunsu Kim, Gary Geunbae Lee

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments and validations substantiate the efficacy of our proposed approach in preserving and harnessing speaker-specific attributes visa-vis alternative baseline models.
Researcher Affiliation	Collaboration	Yejin Jeon1, Yunsu Kim2, Gary Geunbae Lee1,3 1Graduate School of Artificial Intelligence, POSTECH, Republic of Korea 2ai Xplain Inc. Los Gatos, CA, USA 3Department of Computer Science and Engineering, POSTECH, Republic of Korea
Pseudocode	No	The paper describes the model architecture and components in text and figures, but does not include any pseudocode or algorithm blocks.
Open Source Code	No	The paper references open-source third-party tools and models (e.g., 'En Codec model from https://github.com/facebookresearch/encodec', 'pymcd library', 'Open AI s pretrained Whisper large-v2 model', 'jiwer package') but does not state that the source code for their own proposed methodology is released or provide a link to it.
Open Datasets	Yes	The benchmark Libri TTS train-clean-100 dataset (Zen et al. 2019) is used to conduct training and validation.
Dataset Splits	No	The paper states 'The benchmark Libri TTS train-clean-100 dataset (Zen et al. 2019) is used to conduct training and validation,' but it does not provide specific percentages, sample counts, or explicit details of the validation split used.
Hardware Specification	Yes	All experiments are conducted on a single RTX A6000 GPU with a batch size of 16, until step 300,000.
Software Dependencies	No	The paper mentions software like 'Librosa library', 'G2P library', 'Hi Fi-GAN vocoder', 'Whisper large-v2 model', and 'jiwer package' but does not provide specific version numbers for any of these software dependencies.
Experiment Setup	Yes	Adam optimization is utilized with hyperparameters β1 = 0.9, β2 = 0.98, and ϵ = 10 9. To convert the generated mel-spectrograms into audio, we employ the Hi Fi-GAN (Kong, Kim, and Bae 2020) vocoder. All experiments are conducted on a single RTX A6000 GPU with a batch size of 16, until step 300,000.