reproducibilityindex.ai

Almost Unsupervised Text to Speech and Automatic Speech Recognition

Authors: Yi Ren, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, Tie-Yan Liu

ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct experiments on the LJSpeech dataset by leveraging only 200 paired speech and text data and extra unpaired data. First, our proposed method can generate intelligible voice with a word level intelligible rate of 99.84%, compared with nearly 0 intelligible rate if training on only 200 paired data. Second, our method can achieve 2.68 MOS for TTS and 11.7% PER for ASR, outperforming the baseline model trained on only 200 paired data.
Researcher Affiliation	Collaboration	Yi Ren * 1 Xu Tan * 2 Tao Qin 2 Sheng Zhao 3 Zhou Zhao 1 Tie-Yan Liu 2 *Equal contribution. This work was conducted in Microsoft Research Asia. 1Zhejiang University 2Microsoft Research 3Microsoft STC Asia.
Pseudocode	No	The paper describes the model architecture and training flow using figures and mathematical equations but does not include structured pseudocode or algorithm blocks.
Open Source Code	No	Audio samples can be accessed on https://speechresearch.github. io/unsuper/ and we will release the codes soon.
Open Datasets	Yes	We choose the speech and text data from LJSpeech dataset (Ito, 2017) for training. LJSpeech contains 13,100 English audio clips and the corresponding transcripts. The total length of the audio is approximate 24 hours.
Dataset Splits	Yes	We randomly split the dataset into 3 sets: 12500 samples in training set, 300 samples in validation set and 300 samples in test set.
Hardware Specification	Yes	We train the Transformer model on 4 NVIDIA P100 GPUs.
Software Dependencies	No	The paper mentions using Adam optimizer and following the learning rate schedule from Vaswani et al. (2017) but does not specify versions for software libraries or dependencies like Python, PyTorch, or TensorFlow.
Experiment Setup	Yes	The batchsize is 512 sequences in total, which contains 128 sequences for denoising auto-encoder (as shown in Equation 6, each loss term with 32 sequences) and 256 sequences for dual transformation (as shown in Equation 7, each loss term with 32 sequences), as well as 128 sequences from the limited paired data (as shown in Equation 8, each loss term with 32 sequences). When training with the denoising auto-encoder loss, we simply mask the elements in the speech and text sequence with a probability of 0.3, as the corrupt operation described in Section 3.1. We use the Adam optimizer with β1 = 0.9, β2 = 0.98, ε = 10 9 and follow the same learning rate schedule in Vaswani et al. (2017).