Neural TTS Stylization with Adversarial and Collaborative Games

Authors: Shuang Ma, Daniel Mcduff, Yale Song

ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our model from three perspectives: content vs. style disentanglement ability (Sec. 5.1), effectiveness of style modeling (Sec. 5.2), and controllability (Sec. 5.3). We use two datasets: EMT-4, an in-house dataset of 22,377 American English audio-text samples... VCTK, a publicly available, multi-speaker dataset...
Researcher Affiliation Collaboration Shuang Ma State University of New York at Buffalo Buffalo, NY shuangma@buffalo.edu Daniel Mc Duff Microsoft Research Redmond, WA damcduff@microsoft.com Yale Song Microsoft Cloud & AI Redmond, WA yalesong@microsoft.com
Pseudocode No The paper provides schematic diagrams of network architectures (Figure 1, 3, 4, 5) but does not include any pseudocode or algorithm blocks.
Open Source Code No Project webpage: https://researchdemopage.wixsite.com/tts-gan. This is a project demonstration page, not a direct link to the source code repository.
Open Datasets No We use two datasets: EMT-4, an in-house dataset... VCTK, a publicly available, multi-speaker dataset... The paper states VCTK is publicly available but does not provide a specific link, DOI, repository name, or formal citation with author names and year for access.
Dataset Splits No The paper mentions training steps and selecting samples from the test set for evaluation but does not specify explicit train/validation/test dataset splits, proportions, or sample counts.
Hardware Specification No The paper does not provide specific details about the hardware used to run the experiments, such as GPU models, CPU models, or cloud computing specifications.
Software Dependencies No The paper mentions software components like Tacotron, WaveNet, and the Griffin-Lim method, but it does not provide specific version numbers for these or other software dependencies.
Experiment Setup Yes We set α = 0.1, β = 10 in our experiments. We train our model with a minibatch size of 32 using the Adam optimizer; we iterated 200K steps for EMT-4 and 280K steps for VCTK datasets. The six Conv2D layers have [32, 32, 64, 64, 128, 128] filters, respectively, each with a kernel size 3 3 and a stride of 2 2. Each layer is followed by a Re LU activation and batch normalization.