reproducibilityindex.ai

End-to-end Adversarial Text-to-Speech

Authors: Jeff Donahue, Sander Dieleman, Mikolaj Binkowski, Erich Elsen, Karen Simonyan

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section we discuss the setup and results of our empirical evaluation, describing the hyperparameter settings used for training and validating the architectural decisions and loss function components detailed in Section 2. Our primary metric used to evaluate speech quality is the Mean Opinion Score (MOS) given by human raters, computed by taking the mean of 1-5 naturalness ratings given across 1000 held-out conditioning sequences.
Researcher Affiliation	Industry	Jeff Donahue , Sander Dieleman , Mikołaj Bi nkowski, Erich Elsen, Karen Simonyan Deep Mind {jeffdonahue,sedielem,binek,eriche,simonyan}@google.com
Pseudocode	Yes	B ALIGNER PSEUDOCODE In Figure 3 we present pseudocode for the EATS aligner described in Section 2.1.
Open Source Code	No	Models were implemented using Tensor Flow (Abadi et al., 2015) v1 framework and the Sonnet (Reynolds et al., 2017) neural network library. We used the TF-Replicator (Buchlovsky et al., 2019) library for data parallel training over TPUs. Samples from each ablation are available at https://deepmind.com/research/publications/End-to-End-Adversarial-Text-to-Speech.
Open Datasets	No	We train all models on a private dataset that consists of high-quality recordings of human speech performed by professional voice actors, and corresponding text.
Dataset Splits	Yes	Our primary metric used to evaluate speech quality is the Mean Opinion Score (MOS) given by human raters, computed by taking the mean of 1-5 naturalness ratings given across 1000 held-out conditioning sequences. FDSD scores presented here were computed on held-out validation multi-speaker set... we used 5,120 samples for FDSD and 1,000 for MOS.
Hardware Specification	Yes	In Table 3 we report benchmarks for EATS batched inference on two modern hardware platforms (Google Cloud TPU v3, NVIDIA V100 GPU, Intel Xeon E5-1650 CPU).
Software Dependencies	Yes	Models were implemented using Tensor Flow (Abadi et al., 2015) v1 framework and the Sonnet (Reynolds et al., 2017) neural network library. We used the TF-Replicator (Buchlovsky et al., 2019) library for data parallel training over TPUs.
Experiment Setup	Yes	Our models are trained for 5 105 steps, where a single step consists of one discriminator update followed by one generator update, each using a minibatch size of 1024, with batches sampled independently in each of these two updates. Both updates are computed using the Adam optimizer (Kingma & Ba, 2015) with β1 = 0 and β2 = 0.999, and a learning rate of 10 3 with a cosine decay (Loshchilov & Hutter, 2017) schedule used such that the learning rate is 0 at step 500K.