Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis
Authors: Ye Jia, Yu Zhang, Ron Weiss, Quan Wang, Jonathan Shen, Fei Ren, zhifeng Chen, Patrick Nguyen, Ruoming Pang, Ignacio Lopez Moreno, Yonghui Wu
NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We describe a neural network-based system for text-to-speech (TTS) synthesis that is able to generate speech audio in the voice of different speakers, including those unseen during training. ... We demonstrate that the proposed model is able to transfer the knowledge of speaker variability learned by the discriminatively-trained speaker encoder to the multispeaker TTS task, and is able to synthesize natural speech from speakers unseen during training. ... Section 3 Experiments, Table 1: Speech naturalness Mean Opinion Score (MOS) with 95% confidence intervals. |
| Researcher Affiliation | Industry | Google Inc. {jiaye,ngyuzh,ronw}@google.com |
| Pseudocode | No | The paper describes the system architecture and components but does not include any explicit pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide explicit statements about the release of source code for the described methodology, nor does it include a link to a code repository. |
| Open Datasets | Yes | We used two public datasets for training the speech synthesis and vocoder networks. VCTK [21] contains 44 hours of clean speech from 109 speakers... Libri Speech [12] consists of the union of the two clean training sets, comprising 436 hours of speech from 1,172 speakers... |
| Dataset Splits | Yes | We downsampled the audio to 24 k Hz, trimmed leading and trailing silence (reducing the median duration from 3.3 seconds to 1.8 seconds), and split into three subsets: train, validation (containing the same speakers as the train set) and test (containing 11 speakers held out from the train and validation sets). |
| Hardware Specification | No | The paper does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used for running the experiments. |
| Software Dependencies | No | The paper mentions software components like Tacotron 2 and Wave Net, but does not provide specific version numbers for software dependencies or libraries (e.g., 'Python 3.8', 'PyTorch 1.9'). |
| Experiment Setup | Yes | Input 40-channel log-mel spectrograms are passed to a network consisting of a stack of 3 LSTM layers of 768 cells, each followed by a projection to 256 dimensions. ... Target spectrogram features are computed from 50ms windows computed with a 12.5ms step, passed through an 80-channel mel-scale filterbank followed by log dynamic range compression. We extend [15] by augmenting the L2 loss on the predicted spectrogram with an additional L1 loss. |