Neural TTS Stylization with Adversarial and Collaborative Games
Authors: Shuang Ma, Daniel Mcduff, Yale Song
ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our model from three perspectives: content vs. style disentanglement ability (Sec. 5.1), effectiveness of style modeling (Sec. 5.2), and controllability (Sec. 5.3). We use two datasets: EMT-4, an in-house dataset of 22,377 American English audio-text samples... VCTK, a publicly available, multi-speaker dataset... |
| Researcher Affiliation | Collaboration | Shuang Ma State University of New York at Buffalo Buffalo, NY shuangma@buffalo.edu Daniel Mc Duff Microsoft Research Redmond, WA damcduff@microsoft.com Yale Song Microsoft Cloud & AI Redmond, WA yalesong@microsoft.com |
| Pseudocode | No | The paper provides schematic diagrams of network architectures (Figure 1, 3, 4, 5) but does not include any pseudocode or algorithm blocks. |
| Open Source Code | No | Project webpage: https://researchdemopage.wixsite.com/tts-gan. This is a project demonstration page, not a direct link to the source code repository. |
| Open Datasets | No | We use two datasets: EMT-4, an in-house dataset... VCTK, a publicly available, multi-speaker dataset... The paper states VCTK is publicly available but does not provide a specific link, DOI, repository name, or formal citation with author names and year for access. |
| Dataset Splits | No | The paper mentions training steps and selecting samples from the test set for evaluation but does not specify explicit train/validation/test dataset splits, proportions, or sample counts. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used to run the experiments, such as GPU models, CPU models, or cloud computing specifications. |
| Software Dependencies | No | The paper mentions software components like Tacotron, WaveNet, and the Griffin-Lim method, but it does not provide specific version numbers for these or other software dependencies. |
| Experiment Setup | Yes | We set α = 0.1, β = 10 in our experiments. We train our model with a minibatch size of 32 using the Adam optimizer; we iterated 200K steps for EMT-4 and 280K steps for VCTK datasets. The six Conv2D layers have [32, 32, 64, 64, 128, 128] filters, respectively, each with a kernel size 3 3 and a stride of 2 2. Each layer is followed by a Re LU activation and batch normalization. |