reproducibilityindex.ai

Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis

Authors: Ye Jia, Yu Zhang, Ron Weiss, Quan Wang, Jonathan Shen, Fei Ren, zhifeng Chen, Patrick Nguyen, Ruoming Pang, Ignacio Lopez Moreno, Yonghui Wu

NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We describe a neural network-based system for text-to-speech (TTS) synthesis that is able to generate speech audio in the voice of different speakers, including those unseen during training. ... We demonstrate that the proposed model is able to transfer the knowledge of speaker variability learned by the discriminatively-trained speaker encoder to the multispeaker TTS task, and is able to synthesize natural speech from speakers unseen during training. ... Section 3 Experiments, Table 1: Speech naturalness Mean Opinion Score (MOS) with 95% confidence intervals.
Researcher Affiliation	Industry	Google Inc. {jiaye,ngyuzh,ronw}@google.com
Pseudocode	No	The paper describes the system architecture and components but does not include any explicit pseudocode or algorithm blocks.
Open Source Code	No	The paper does not provide explicit statements about the release of source code for the described methodology, nor does it include a link to a code repository.
Open Datasets	Yes	We used two public datasets for training the speech synthesis and vocoder networks. VCTK [21] contains 44 hours of clean speech from 109 speakers... Libri Speech [12] consists of the union of the two clean training sets, comprising 436 hours of speech from 1,172 speakers...
Dataset Splits	Yes	We downsampled the audio to 24 k Hz, trimmed leading and trailing silence (reducing the median duration from 3.3 seconds to 1.8 seconds), and split into three subsets: train, validation (containing the same speakers as the train set) and test (containing 11 speakers held out from the train and validation sets).
Hardware Specification	No	The paper does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used for running the experiments.
Software Dependencies	No	The paper mentions software components like Tacotron 2 and Wave Net, but does not provide specific version numbers for software dependencies or libraries (e.g., 'Python 3.8', 'PyTorch 1.9').
Experiment Setup	Yes	Input 40-channel log-mel spectrograms are passed to a network consisting of a stack of 3 LSTM layers of 768 cells, each followed by a projection to 256 dimensions. ... Target spectrogram features are computed from 50ms windows computed with a 12.5ms step, passed through an 80-channel mel-scale filterbank followed by log dynamic range compression. We extend [15] by augmenting the L2 loss on the predicted spectrogram with an additional L1 loss.