High Fidelity Speech Synthesis with Adversarial Networks

Authors: Mikołaj Bińkowski, Jeff Donahue, Sander Dieleman, Aidan Clark, Erich Elsen, Norman Casagrande, Luis C. Cobo, Karen Simonyan

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To measure the performance of GAN-TTS, we employ both subjective human evaluation (MOS Mean Opinion Score), as well as novel quantitative metrics (Fr echet Deep Speech Distance and Kernel Deep Speech Distance), which we find to be well correlated with MOS. We show that GAN-TTS is capable of generating high-fidelity speech with naturalness comparable to the state-of-the-art models, and unlike autoregressive models, it is highly parallelisable thanks to an efficient feed-forward generator.
Researcher Affiliation Collaboration Mikołaj Bi nkowski Department of Mathematics Imperial College London mikbinkowski@gmail.com Jeff Donahue, Sander Dieleman, Aidan Clark, Erich Elsen, Norman Casagrande, Luis C. Cobo, Karen Simonyan Deep Mind {jeffdonahue,sedielem,aidanclark,eriche,ncasagrande, luisca,simonyan}@google.com
Pseudocode Yes Algorithm 1 in Appendix D shows pseudocode for computation of RWD . In Algorithm 1 we present the pseudocode for training GAN-TTS.
Open Source Code Yes We propose a family of quantitative metrics for speech generation based on Fr echet Inception Distance (FID, Heusel et al., 2017) and Kernel Inception Distance (KID, Bi nkowski et al., 2018), where we replace the Inception image recognition network with the Deep Speech audio recognition network. The code for our metrics is publicly available online1. [Footnote]: 1https://github.com/mbinkowski/Deep Speech Distances
Open Datasets No The paper describes the characteristics of the dataset used for training, stating, 'Our text-to-speech models are trained on a dataset which contains high-fidelity audio of human speech with the corresponding linguistic features and pitch information.' However, it does not provide a name, link, or citation for public access to this specific dataset.
Dataset Splits No The paper does not explicitly specify distinct training, validation, and test dataset splits with percentages, sample counts, or references to predefined splits.
Hardware Specification Yes We trained our models on Cloud TPU v3 Pods with data parallelism over 128 replicas for 1 million generator and discriminator updates, which usually took up to 48 hours.
Software Dependencies No The paper mentions using the 'pre-trained Deep Speech2 model from the NVIDIA Open Seq2Seq library (Kuchaiev et al., 2018)' but does not provide specific version numbers for Open Seq2Seq or any other software libraries or frameworks used in their implementation.
Experiment Setup Yes We train all models with a single discriminator step per generator step, but with doubled learning rate: 10 4 for the former, compared to 5 10 5 for the latter. We use the hinge loss (Lim & Ye, 2017), a batch size of 1024 and the Adam optimizer (Kingma & Ba, 2015) with hyperparameters β1 = 0, β2 = 0.999. Following Brock et al. (2019), we use spectral normalisation (Miyato et al., 2018) and orthogonal initialisation (Saxe et al., 2014) in both the generator and discriminator(s), and apply off-diagonal orthogonal regularisation (Brock et al., 2016; 2019) and exponential moving averaging to the generator weights with a decay rate of 0.9999 for sampling.