reproducibilityindex.ai

Sample Efficient Adaptive Text-to-Speech

Authors: Yutian Chen, Yannis Assael, Brendan Shillingford, David Budden, Scott Reed, Heiga Zen, Quan Wang, Luis C. Cobo, Andrew Trask, Ben Laurie, Caglar Gulcehre, Aäron van den Oord, Oriol Vinyals, Nando de Freitas

ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We present a meta-learning approach for adaptive text-to-speech (TTS) with few data. ... The experiments show that these approaches are successful at adapting the multi-speaker neural network to new speakers, obtaining state-of-the-art results in both sample naturalness and voice similarity with merely a few minutes of audio data from new speakers. ... In this section, we evaluate the quality of samples of SEA-ALL, SEA-EMB and SEA-ENC. We ﬁrst measure the naturalness of the generated utterances using the standard Mean Opinion Score (MOS) procedure. Then, we evaluate the similarity of generated and real samples using the subjective MOS test and objectively using a speaker veriﬁcation system (Wan et al., 2018). Finally, we study these results varying the size of the adaptation dataset.
Researcher Affiliation	Industry	Deep Mind & Google yutianc@google.com
Pseudocode	No	The paper includes mathematical equations and architectural diagrams but no pseudocode or algorithm blocks.
Open Source Code	No	Synthetic utterances are provided on our demo webpage https://sample-efficient-adaptive-tts.github.io/demo. (Explanation: This URL points to a demo page, not to the source code for the methodology.)
Open Datasets	Yes	We train a Wave Net model for each of our three methods using the same dataset, which combines the high-quality Libri Speech audiobook corpus (Panayotov et al., 2015) and a proprietary speech corpus.
Dataset Splits	Yes	For every test speaker, we randomly split their demonstration utterances into an adaptation set for adapting our Wave Net models and a test set for evaluation. ... We therefore hold out 10% of our demonstration data for calculating a standard early termination criterion.
Hardware Specification	No	The paper does not specify the hardware used for running its experiments.
Software Dependencies	No	The paper does not provide specific software dependencies with version numbers needed to replicate the experiment.
Experiment Setup	Yes	The multi-speaker Wave Net model has the same architecture as van den Oord et al. (2016) except that we use a 200-dimensional speaker embedding space to model the large diversity of voices. ... it takes 5 10k optimizing steps to ﬁt the embedding vector, and an additional 100 200 steps to ﬁne-tune the entire model using early stopping for SEA-ALL.