reproducibilityindex.ai

Speech-T: Transducer for Text to Speech and Beyond

Authors: Jiawei Chen, Xu Tan, Yichong Leng, Jin Xu, Guihua Wen, Tao Qin, Tie-Yan Liu

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on LJSpeech datasets demonstrate that Speech-T 1) is more robust than the attention based autoregressive TTS model due to its inherent monotonic alignments between text and speech; 2) naturally supports streaming TTS with good voice quality; and 3) enjoys the beneﬁt of joint modeling TTS and ASR in a single network.
Researcher Affiliation	Collaboration	Jiawei Chen South China University of Technology csjiaweichen@mail.scut.edu.cn Xu Tan Microsoft Research Asia xuta@microsoft.com Yichong Leng University of Science and Technology of China lyc123go@mail.ustc.edu.cn Jin Xu Tsinghua University j-xu18@mails.tsinghua.edu.cn Guihua Wen South China University of Technology crghwen@scut.edu.cn Tao Qin Microsoft Research Asia taoqin@microsoft.com Tie-Yan Liu Microsoft Research Asia tyliu@microsoft.com
Pseudocode	No	The paper includes architectural diagrams (Figure 2, Figure 5) but no formal pseudocode or algorithm blocks.
Open Source Code	No	Footnote 2 states: "The audio samples generated by Speech-T and the baseline systems can be found in https://speechresearch.github.io/speechtransducer/." This link is for audio samples, not source code for the methodology.
Open Datasets	Yes	Dataset We conduct experiments on LJSpeech [10], a public speech dataset consisting of 13,100 English audio clips and corresponding text transcripts. The total length of the audio is approximately 24 hours. We randomly split the dataset into three parts: 12500 samples for training, 300 samples for validation and 300 samples for test.
Dataset Splits	Yes	We randomly split the dataset into three parts: 12500 samples for training, 300 samples for validation and 300 samples for test.
Hardware Specification	Yes	We train our model on 8 Tesla V100 GPUs with a batch size of 6 sentences on each GPU.
Software Dependencies	No	The paper mentions using Adam optimizer and Transformer TTS as a baseline, but does not provide specific version numbers for any software dependencies (e.g., Python, PyTorch, TensorFlow, etc.).
Experiment Setup	Yes	The number of self-attention head is set to 2, the dimension of embedding size and hidden size are both 256, and the inner dimension of feed-forward network is 1024. We use the Adam optimizer with β1 = 0.9, β2 = 0.98, = 10^-9 and follow the same learning rate schedule in [29].