SyntaSpeech: Syntax-Aware Generative Adversarial Text-to-Speech
Authors: Zhenhui Ye, Zhou Zhao, Yi Ren, Fei Wu
IJCAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on three datasets not only show that the tree-structured syntactic information grants Synta Speech the ability to synthesize better audio with expressive prosody, but also demonstrate the generalization ability of Synta Speech to adapt to multiple languages and multi-speaker text-to-speech. Ablation studies demonstrate the necessity of each component in Synta Speech. |
| Researcher Affiliation | Academia | Zhenhui Ye , Zhou Zhao , Yi Ren and Fei Wu College of Computer Science and Technology, Zhejiang University {zhenhuiye, zhaozhou, rayeren, wufei}@zju.edu.cn |
| Pseudocode | No | The paper describes the proposed model architecture and its components in text and diagrams (Figure 2), but does not provide structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Source code and audio samples are available at https://syntaspeech.github.io. |
| Open Datasets | Yes | We evaluate Synta Speech on three datasets: 1) LJSpeech1 [Ito and Johnson, 2017], a singlespeaker database which contains 13,100 English audio clips with a total of nearly 24 hours speech; 2) Biaobei2, a Chinese speech corpus consists of 10,000 sentences (about 12 hours) from a Chinese speaker; 3) Libri TTS3 [Zen et al., 2019], an English dataset with 149,736 audio clips (about 245 hours) from 1,151 speakers (We only use train clean360 and train clean100). |
| Dataset Splits | No | We train the Synta Speech on 1 Nvidia 2080Ti GPU with a batch size of 64 sentences. We conduct MOS (mean opinion score) and CMOS (comparative mean opinion score) evaluations on the test set via Amazon Mechanical Turk. The paper mentions using "train clean360 and train clean100" for Libri TTS, implying standard training splits, but does not provide explicit details (percentages or sample counts) for train/validation/test splits for all datasets used, nor specifically for a validation set. |
| Hardware Specification | Yes | We train the Synta Speech on 1 Nvidia 2080Ti GPU with a batch size of 64 sentences. |
| Software Dependencies | No | The paper mentions optimizers (Adam) and vocoders (Hi Fi-GAN, Parallel Wave GAN) used, but does not specify software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | We train the Synta Speech on 1 Nvidia 2080Ti GPU with a batch size of 64 sentences. We use the Adam optimizer with β1 = 0.9 ,β2 = 0.98, ϵ = 10 9 and follow the same learning rate schedule in [Vaswani et al., 2017]. It takes 320k steps for training until convergence. |