End-to-end Adversarial Text-to-Speech
Authors: Jeff Donahue, Sander Dieleman, Mikolaj Binkowski, Erich Elsen, Karen Simonyan
ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section we discuss the setup and results of our empirical evaluation, describing the hyperparameter settings used for training and validating the architectural decisions and loss function components detailed in Section 2. Our primary metric used to evaluate speech quality is the Mean Opinion Score (MOS) given by human raters, computed by taking the mean of 1-5 naturalness ratings given across 1000 held-out conditioning sequences. |
| Researcher Affiliation | Industry | Jeff Donahue , Sander Dieleman , Mikołaj Bi nkowski, Erich Elsen, Karen Simonyan Deep Mind {jeffdonahue,sedielem,binek,eriche,simonyan}@google.com |
| Pseudocode | Yes | B ALIGNER PSEUDOCODE In Figure 3 we present pseudocode for the EATS aligner described in Section 2.1. |
| Open Source Code | No | Models were implemented using Tensor Flow (Abadi et al., 2015) v1 framework and the Sonnet (Reynolds et al., 2017) neural network library. We used the TF-Replicator (Buchlovsky et al., 2019) library for data parallel training over TPUs. Samples from each ablation are available at https://deepmind.com/research/publications/End-to-End-Adversarial-Text-to-Speech. |
| Open Datasets | No | We train all models on a private dataset that consists of high-quality recordings of human speech performed by professional voice actors, and corresponding text. |
| Dataset Splits | Yes | Our primary metric used to evaluate speech quality is the Mean Opinion Score (MOS) given by human raters, computed by taking the mean of 1-5 naturalness ratings given across 1000 held-out conditioning sequences. FDSD scores presented here were computed on held-out validation multi-speaker set... we used 5,120 samples for FDSD and 1,000 for MOS. |
| Hardware Specification | Yes | In Table 3 we report benchmarks for EATS batched inference on two modern hardware platforms (Google Cloud TPU v3, NVIDIA V100 GPU, Intel Xeon E5-1650 CPU). |
| Software Dependencies | Yes | Models were implemented using Tensor Flow (Abadi et al., 2015) v1 framework and the Sonnet (Reynolds et al., 2017) neural network library. We used the TF-Replicator (Buchlovsky et al., 2019) library for data parallel training over TPUs. |
| Experiment Setup | Yes | Our models are trained for 5 105 steps, where a single step consists of one discriminator update followed by one generator update, each using a minibatch size of 1024, with batches sampled independently in each of these two updates. Both updates are computed using the Adam optimizer (Kingma & Ba, 2015) with β1 = 0 and β2 = 0.999, and a learning rate of 10 3 with a cosine decay (Loshchilov & Hutter, 2017) schedule used such that the learning rate is 0 at step 500K. |