Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis
Authors: Yuxuan Wang, Daisy Stanton, Yu Zhang, RJ-Skerry Ryan, Eric Battenberg, Joel Shor, Ying Xiao, Ye Jia, Fei Ren, Rif A. Saurous
ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we propose global style tokens (GSTs), a bank of embeddings that are jointly trained within Tacotron, a state-of-the-art end-to-end speech synthesis system. The embeddings are trained with no explicit labels, yet learn to model a large range of acoustic expressiveness. GSTs lead to a rich set of significant results. The soft interpretable labels they generate can be used to control synthesis in novel ways, such as varying speed and speaking style independently of the text content. They can also be used for style transfer, replicating the speaking style of a single audio clip across an entire long-form text corpus. When trained on noisy, unlabeled found data, GSTs learn to factorize noise and speaker identity, providing a path towards highly scalable but robust speech synthesis. We train models using 147 hours of American English audiobook data. Table 1. Sx S subjective preference (%) and p-values of GST audiobook synthesis against a Tacotron baseline. Table 2. Robust MOS as a function of the percentage of interference in the training set. Table 3. WER for the Spanish to English unsupervised language transfer experiment. To evaluate this result objectively, we compute word error rates (WER) of an English ASR model on the synthesized speech. We use t-SNE (Maaten & Hinton, 2008) to visualize the style embeddings learned from both the artificial noise and TED datasets. |
| Researcher Affiliation | Industry | Yuxuan Wang 1 Daisy Stanton 1 Yu Zhang 1 RJ Skerry-Ryan 1 Eric Battenberg 1 Joel Shor 1 Ying Xiao 1 Fei Ren 1 Ye Jia 1 Rif A. Saurous 1 1Google, Inc.. |
| Pseudocode | No | The paper does not contain any pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | No | The paper provides a link to 'Sound demos' but does not provide a link to the source code for the methodology described in the paper. The text does not contain any explicit statements about code release. |
| Open Datasets | Yes | We train models using 147 hours of American English audiobook data. These are read by the 2013 Blizzard Challenge speaker, Catherine Byers, in an animated and emotive storytelling style. Our second experiment uses real data. This dataset is made up of audio tracks mined from 439 official TED You Tube channel videos. |
| Dataset Splits | No | The paper mentions 'evaluation set' but does not provide specific training/validation/test dataset splits, percentages, or absolute sample counts needed for reproduction. It refers to 'evaluation sets' for testing but does not clearly define a validation split. |
| Hardware Specification | No | The paper does not explicitly describe the hardware used to run its experiments, such as specific GPU or CPU models, or cloud computing instance types. |
| Software Dependencies | No | The paper mentions components like GRU, LSTMs, Zoneout, and Griffin-Lim, but it does not specify any software dependencies with version numbers (e.g., Python 3.x, TensorFlow 1.x, PyTorch 1.x). |
| Experiment Setup | Yes | For our baseline and GST-augmented Tacotron systems, we use the same architecture and hyperparameters as (Wang et al., 2017a) except for a few details. We use phoneme inputs to speed up training, and slightly change the decoder, replacing GRU cells with two layers of 256-cell LSTMs; these are regularized using zoneout (Krueger et al., 2017) with probability 0.1. Our experiments use 10 tokens with dimension 256. When using h attention heads, we set the token embedding size to be 256/h and concatenate the attention outputs, such that the final style embedding size remains the same. The reference encoder is made up of a convolutional stack, followed by an RNN. It takes as input a log-mel spectrogram, which is first passed to a stack of six 2-D convolutional layers with 3 3 kernel, 2 2 stride, batch normalization and Re LU activation function. We use 32, 32, 64, 64, 128 and 128 output channels for the 6 convolutional layers, respectively. The resulting output tensor is then shaped back to 3 dimensions (preserving the output time resolution) and fed to a single-layer 128-unit unidirectional GRU. |