Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Speech-T: Transducer for Text to Speech and Beyond

Authors: Jiawei Chen, Xu Tan, Yichong Leng, Jin Xu, Guihua Wen, Tao Qin, Tie-Yan Liu

NeurIPS 2021 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on LJSpeech datasets demonstrate that Speech-T 1) is more robust than the attention based autoregressive TTS model due to its inherent monotonic alignments between text and speech; 2) naturally supports streaming TTS with good voice quality; and 3) enjoys the benefit of joint modeling TTS and ASR in a single network.
Researcher Affiliation Collaboration Jiawei Chen South China University of Technology EMAIL Xu Tan Microsoft Research Asia EMAIL Yichong Leng University of Science and Technology of China EMAIL Jin Xu Tsinghua University EMAIL Guihua Wen South China University of Technology EMAIL Tao Qin Microsoft Research Asia EMAIL Tie-Yan Liu Microsoft Research Asia EMAIL
Pseudocode No The paper includes architectural diagrams (Figure 2, Figure 5) but no formal pseudocode or algorithm blocks.
Open Source Code No Footnote 2 states: "The audio samples generated by Speech-T and the baseline systems can be found in https://speechresearch.github.io/speechtransducer/." This link is for audio samples, not source code for the methodology.
Open Datasets Yes Dataset We conduct experiments on LJSpeech [10], a public speech dataset consisting of 13,100 English audio clips and corresponding text transcripts. The total length of the audio is approximately 24 hours. We randomly split the dataset into three parts: 12500 samples for training, 300 samples for validation and 300 samples for test.
Dataset Splits Yes We randomly split the dataset into three parts: 12500 samples for training, 300 samples for validation and 300 samples for test.
Hardware Specification Yes We train our model on 8 Tesla V100 GPUs with a batch size of 6 sentences on each GPU.
Software Dependencies No The paper mentions using Adam optimizer and Transformer TTS as a baseline, but does not provide specific version numbers for any software dependencies (e.g., Python, PyTorch, TensorFlow, etc.).
Experiment Setup Yes The number of self-attention head is set to 2, the dimension of embedding size and hidden size are both 256, and the inner dimension of feed-forward network is 1024. We use the Adam optimizer with β1 = 0.9, β2 = 0.98, = 10^-9 and follow the same learning rate schedule in [29].