Speech-T: Transducer for Text to Speech and Beyond
Authors: Jiawei Chen, Xu Tan, Yichong Leng, Jin Xu, Guihua Wen, Tao Qin, Tie-Yan Liu
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on LJSpeech datasets demonstrate that Speech-T 1) is more robust than the attention based autoregressive TTS model due to its inherent monotonic alignments between text and speech; 2) naturally supports streaming TTS with good voice quality; and 3) enjoys the benefit of joint modeling TTS and ASR in a single network. |
| Researcher Affiliation | Collaboration | Jiawei Chen South China University of Technology csjiaweichen@mail.scut.edu.cn Xu Tan Microsoft Research Asia xuta@microsoft.com Yichong Leng University of Science and Technology of China lyc123go@mail.ustc.edu.cn Jin Xu Tsinghua University j-xu18@mails.tsinghua.edu.cn Guihua Wen South China University of Technology crghwen@scut.edu.cn Tao Qin Microsoft Research Asia taoqin@microsoft.com Tie-Yan Liu Microsoft Research Asia tyliu@microsoft.com |
| Pseudocode | No | The paper includes architectural diagrams (Figure 2, Figure 5) but no formal pseudocode or algorithm blocks. |
| Open Source Code | No | Footnote 2 states: "The audio samples generated by Speech-T and the baseline systems can be found in https://speechresearch.github.io/speechtransducer/." This link is for audio samples, not source code for the methodology. |
| Open Datasets | Yes | Dataset We conduct experiments on LJSpeech [10], a public speech dataset consisting of 13,100 English audio clips and corresponding text transcripts. The total length of the audio is approximately 24 hours. We randomly split the dataset into three parts: 12500 samples for training, 300 samples for validation and 300 samples for test. |
| Dataset Splits | Yes | We randomly split the dataset into three parts: 12500 samples for training, 300 samples for validation and 300 samples for test. |
| Hardware Specification | Yes | We train our model on 8 Tesla V100 GPUs with a batch size of 6 sentences on each GPU. |
| Software Dependencies | No | The paper mentions using Adam optimizer and Transformer TTS as a baseline, but does not provide specific version numbers for any software dependencies (e.g., Python, PyTorch, TensorFlow, etc.). |
| Experiment Setup | Yes | The number of self-attention head is set to 2, the dimension of embedding size and hidden size are both 256, and the inner dimension of feed-forward network is 1024. We use the Adam optimizer with β1 = 0.9, β2 = 0.98, = 10^-9 and follow the same learning rate schedule in [29]. |