Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Speech-T: Transducer for Text to Speech and Beyond
Authors: Jiawei Chen, Xu Tan, Yichong Leng, Jin Xu, Guihua Wen, Tao Qin, Tie-Yan Liu
NeurIPS 2021 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on LJSpeech datasets demonstrate that Speech-T 1) is more robust than the attention based autoregressive TTS model due to its inherent monotonic alignments between text and speech; 2) naturally supports streaming TTS with good voice quality; and 3) enjoys the benefit of joint modeling TTS and ASR in a single network. |
| Researcher Affiliation | Collaboration | Jiawei Chen South China University of Technology EMAIL Xu Tan Microsoft Research Asia EMAIL Yichong Leng University of Science and Technology of China EMAIL Jin Xu Tsinghua University EMAIL Guihua Wen South China University of Technology EMAIL Tao Qin Microsoft Research Asia EMAIL Tie-Yan Liu Microsoft Research Asia EMAIL |
| Pseudocode | No | The paper includes architectural diagrams (Figure 2, Figure 5) but no formal pseudocode or algorithm blocks. |
| Open Source Code | No | Footnote 2 states: "The audio samples generated by Speech-T and the baseline systems can be found in https://speechresearch.github.io/speechtransducer/." This link is for audio samples, not source code for the methodology. |
| Open Datasets | Yes | Dataset We conduct experiments on LJSpeech [10], a public speech dataset consisting of 13,100 English audio clips and corresponding text transcripts. The total length of the audio is approximately 24 hours. We randomly split the dataset into three parts: 12500 samples for training, 300 samples for validation and 300 samples for test. |
| Dataset Splits | Yes | We randomly split the dataset into three parts: 12500 samples for training, 300 samples for validation and 300 samples for test. |
| Hardware Specification | Yes | We train our model on 8 Tesla V100 GPUs with a batch size of 6 sentences on each GPU. |
| Software Dependencies | No | The paper mentions using Adam optimizer and Transformer TTS as a baseline, but does not provide specific version numbers for any software dependencies (e.g., Python, PyTorch, TensorFlow, etc.). |
| Experiment Setup | Yes | The number of self-attention head is set to 2, the dimension of embedding size and hidden size are both 256, and the inner dimension of feed-forward network is 1024. We use the Adam optimizer with β1 = 0.9, β2 = 0.98, = 10^-9 and follow the same learning rate schedule in [29]. |