reproducibilityindex.ai

SyntaSpeech: Syntax-Aware Generative Adversarial Text-to-Speech

Authors: Zhenhui Ye, Zhou Zhao, Yi Ren, Fei Wu

IJCAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on three datasets not only show that the tree-structured syntactic information grants Synta Speech the ability to synthesize better audio with expressive prosody, but also demonstrate the generalization ability of Synta Speech to adapt to multiple languages and multi-speaker text-to-speech. Ablation studies demonstrate the necessity of each component in Synta Speech.
Researcher Affiliation	Academia	Zhenhui Ye , Zhou Zhao , Yi Ren and Fei Wu College of Computer Science and Technology, Zhejiang University {zhenhuiye, zhaozhou, rayeren, wufei}@zju.edu.cn
Pseudocode	No	The paper describes the proposed model architecture and its components in text and diagrams (Figure 2), but does not provide structured pseudocode or algorithm blocks.
Open Source Code	Yes	Source code and audio samples are available at https://syntaspeech.github.io.
Open Datasets	Yes	We evaluate Synta Speech on three datasets: 1) LJSpeech1 [Ito and Johnson, 2017], a singlespeaker database which contains 13,100 English audio clips with a total of nearly 24 hours speech; 2) Biaobei2, a Chinese speech corpus consists of 10,000 sentences (about 12 hours) from a Chinese speaker; 3) Libri TTS3 [Zen et al., 2019], an English dataset with 149,736 audio clips (about 245 hours) from 1,151 speakers (We only use train clean360 and train clean100).
Dataset Splits	No	We train the Synta Speech on 1 Nvidia 2080Ti GPU with a batch size of 64 sentences. We conduct MOS (mean opinion score) and CMOS (comparative mean opinion score) evaluations on the test set via Amazon Mechanical Turk. The paper mentions using "train clean360 and train clean100" for Libri TTS, implying standard training splits, but does not provide explicit details (percentages or sample counts) for train/validation/test splits for all datasets used, nor specifically for a validation set.
Hardware Specification	Yes	We train the Synta Speech on 1 Nvidia 2080Ti GPU with a batch size of 64 sentences.
Software Dependencies	No	The paper mentions optimizers (Adam) and vocoders (Hi Fi-GAN, Parallel Wave GAN) used, but does not specify software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup	Yes	We train the Synta Speech on 1 Nvidia 2080Ti GPU with a batch size of 64 sentences. We use the Adam optimizer with β1 = 0.9 ,β2 = 0.98, ϵ = 10 9 and follow the same learning rate schedule in [Vaswani et al., 2017]. It takes 320k steps for training until convergence.