Neural Speech Synthesis with Transformer Network
Authors: Naihan Li, Shujie Liu, Yanqing Liu, Sheng Zhao, Ming Liu6706-6713
AAAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments are conducted to test the efficiency and performance of our new network. |
| Researcher Affiliation | Collaboration | Naihan Li, 1,4 Shujie Liu,2 Yanqing Liu,3 Sheng Zhao,3 Ming Liu1,4 1University of Electronic Science and Technology of China 2Microsoft Research Asia 3Microsoft STC Asia 4CETC Big Data Research Institute Co.,Ltd, Guizhou, China |
| Pseudocode | No | The paper does not contain any pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | No | The paper provides a link to audio samples, not the source code for the described methodology. 'Audio samples can be accessed on https://neuraltts.github.io/transformertts/' |
| Open Datasets | No | The paper states that an 'internal US English female dataset' was used, with no public access information provided. 'We use 4 Nvidia Tesla P100 to train our model with an internal US English female dataset, which contains 25-hour professional speech (17584 text, wave pairs, with a few too long waves removed).' |
| Dataset Splits | No | The paper mentions using a 'dynamic batch size' and 'on average 16 samples in single batch per GPU' but does not specify explicit training, validation, or test dataset splits (e.g., percentages or exact counts) for reproducibility. |
| Hardware Specification | Yes | We use 4 Nvidia Tesla P100 to train our model with an internal US English female dataset... |
| Software Dependencies | No | The paper mentions using Tacotron2 and Wave Net as components but does not provide specific version numbers for any software dependencies. |
| Experiment Setup | Yes | Therefore, we use the dynamic batch size where the maximum total number of mel spectrogram frames is fixed and one batch should contain as many samples as possible. Thus there are on average 16 samples in single batch per GPU. ... The sample rate of ground truth audios is 16000 and frame rate (frames per second) of ground truth mel spectrogram is 80. Our autoregressive Wave Net contains 2 QRNN layers and 20 dilated layers, and the sizes of all residual channels and dilation channels are all 256. |