Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron
Authors: RJ Skerry-Ryan, Eric Battenberg, Ying Xiao, Yuxuan Wang, Daisy Stanton, Joel Shor, Ron Weiss, Rob Clark, Rif A. Saurous
ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We define several quantitative and subjective metrics for evaluating prosody transfer, and report results with accompanying audio samples from single-speaker and 44-speaker Tacotron models on a prosody transfer task. |
| Researcher Affiliation | Industry | 1Google, Inc.. Correspondence to: RJ Skerry-Ryan <rjryan@google.com>. |
| Pseudocode | No | The paper describes the model architecture and components using text and diagrams (Figure 1, Figure 2, Figure 3), but it does not include any formal pseudocode or algorithm blocks. |
| Open Source Code | No | The paper states: "Sound demos are available at https://google.github.io/tacotron/publications/end_to_end_prosody_transfer." This link points to a demo page with audio samples, not to the source code for the methodology itself. |
| Open Datasets | Yes | Single-speaker dataset: A single speaker high-quality English dataset of audiobook recordings by Catherine Byers (the speaker from the 2013 Blizzard Challenge). |
| Dataset Splits | No | The paper provides details on training steps and learning rate decay, but it does not specify any training/validation/test dataset splits or validation set sizes for the main experiments. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware (e.g., GPU models, CPU types, memory) used to run the experiments. |
| Software Dependencies | No | The paper mentions several software components like "Tacotron" and "Adam optimizer" but does not specify their version numbers (e.g., TensorFlow version, Python version, specific library versions). |
| Experiment Setup | Yes | We train our models for at least 200k steps with a mini-batch size of 256 using the Adam optimizer (Kingma & Ba, 2015). We start with a learning rate of 1 10 3 and decay it to 5 10 4, 3 10 4, 1 10 4, and 5 10 5 at step 50k, 100k, 150k, and 200k respectively. |