Sample Efficient Adaptive Text-to-Speech

Authors: Yutian Chen, Yannis Assael, Brendan Shillingford, David Budden, Scott Reed, Heiga Zen, Quan Wang, Luis C. Cobo, Andrew Trask, Ben Laurie, Caglar Gulcehre, Aäron van den Oord, Oriol Vinyals, Nando de Freitas

ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We present a meta-learning approach for adaptive text-to-speech (TTS) with few data. ... The experiments show that these approaches are successful at adapting the multi-speaker neural network to new speakers, obtaining state-of-the-art results in both sample naturalness and voice similarity with merely a few minutes of audio data from new speakers. ... In this section, we evaluate the quality of samples of SEA-ALL, SEA-EMB and SEA-ENC. We first measure the naturalness of the generated utterances using the standard Mean Opinion Score (MOS) procedure. Then, we evaluate the similarity of generated and real samples using the subjective MOS test and objectively using a speaker verification system (Wan et al., 2018). Finally, we study these results varying the size of the adaptation dataset.
Researcher Affiliation Industry Deep Mind & Google yutianc@google.com
Pseudocode No The paper includes mathematical equations and architectural diagrams but no pseudocode or algorithm blocks.
Open Source Code No Synthetic utterances are provided on our demo webpage https://sample-efficient-adaptive-tts.github.io/demo. (Explanation: This URL points to a demo page, not to the source code for the methodology.)
Open Datasets Yes We train a Wave Net model for each of our three methods using the same dataset, which combines the high-quality Libri Speech audiobook corpus (Panayotov et al., 2015) and a proprietary speech corpus.
Dataset Splits Yes For every test speaker, we randomly split their demonstration utterances into an adaptation set for adapting our Wave Net models and a test set for evaluation. ... We therefore hold out 10% of our demonstration data for calculating a standard early termination criterion.
Hardware Specification No The paper does not specify the hardware used for running its experiments.
Software Dependencies No The paper does not provide specific software dependencies with version numbers needed to replicate the experiment.
Experiment Setup Yes The multi-speaker Wave Net model has the same architecture as van den Oord et al. (2016) except that we use a 200-dimensional speaker embedding space to model the large diversity of voices. ... it takes 5 10k optimizing steps to fit the embedding vector, and an additional 100 200 steps to fine-tune the entire model using early stopping for SEA-ALL.