Speech-T: Transducer for Text to Speech and Beyond

Authors: Jiawei Chen, Xu Tan, Yichong Leng, Jin Xu, Guihua Wen, Tao Qin, Tie-Yan Liu

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on LJSpeech datasets demonstrate that Speech-T 1) is more robust than the attention based autoregressive TTS model due to its inherent monotonic alignments between text and speech; 2) naturally supports streaming TTS with good voice quality; and 3) enjoys the benefit of joint modeling TTS and ASR in a single network.
Researcher Affiliation Collaboration Jiawei Chen South China University of Technology csjiaweichen@mail.scut.edu.cn Xu Tan Microsoft Research Asia xuta@microsoft.com Yichong Leng University of Science and Technology of China lyc123go@mail.ustc.edu.cn Jin Xu Tsinghua University j-xu18@mails.tsinghua.edu.cn Guihua Wen South China University of Technology crghwen@scut.edu.cn Tao Qin Microsoft Research Asia taoqin@microsoft.com Tie-Yan Liu Microsoft Research Asia tyliu@microsoft.com
Pseudocode No The paper includes architectural diagrams (Figure 2, Figure 5) but no formal pseudocode or algorithm blocks.
Open Source Code No Footnote 2 states: "The audio samples generated by Speech-T and the baseline systems can be found in https://speechresearch.github.io/speechtransducer/." This link is for audio samples, not source code for the methodology.
Open Datasets Yes Dataset We conduct experiments on LJSpeech [10], a public speech dataset consisting of 13,100 English audio clips and corresponding text transcripts. The total length of the audio is approximately 24 hours. We randomly split the dataset into three parts: 12500 samples for training, 300 samples for validation and 300 samples for test.
Dataset Splits Yes We randomly split the dataset into three parts: 12500 samples for training, 300 samples for validation and 300 samples for test.
Hardware Specification Yes We train our model on 8 Tesla V100 GPUs with a batch size of 6 sentences on each GPU.
Software Dependencies No The paper mentions using Adam optimizer and Transformer TTS as a baseline, but does not provide specific version numbers for any software dependencies (e.g., Python, PyTorch, TensorFlow, etc.).
Experiment Setup Yes The number of self-attention head is set to 2, the dimension of embedding size and hidden size are both 256, and the inner dimension of feed-forward network is 1024. We use the Adam optimizer with β1 = 0.9, β2 = 0.98, = 10^-9 and follow the same learning rate schedule in [29].