RWEN-TTS: Relation-Aware Word Encoding Network for Natural Text-to-Speech Synthesis
Authors: Shinhyeok Oh, HyeongRae Noh, Yoonseok Hong, Insoo Oh
AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results show substantial improvements compared to previous works.Experimental Setup Datasets We train and evaluate RWEN on LJSpeech (Ito and Johnson 2017), a single speaker corpus recorded by a female English speaker. |
| Researcher Affiliation | Industry | Shinhyeok Oh*, Hyeong Rae Noh*, Yoonseok Hong, and Insoo Oh Netmarble AI Center {kai, hr noh, yhong, ioh}@netmarble.com |
| Pseudocode | No | The paper contains architectural diagrams (Figures 2, 3, 4) but no structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | More details and samples are in our repository7 and demonstration site8. with footnote 7: https://github.com/shinhyeokoh/rwen |
| Open Datasets | Yes | Datasets We train and evaluate RWEN on LJSpeech (Ito and Johnson 2017), a single speaker corpus recorded by a female English speaker. |
| Dataset Splits | Yes | It consists of 13,100 short audio clips with a total length of 24 hours, being randomly split into 12,500, 100, and 500 samples to comprise the training, validation, and test datasets as in Kim, Kong, and Son (2021). |
| Hardware Specification | Yes | We use mixed precision training on 16 Tesla A100 GPUs for all the experiments. |
| Software Dependencies | No | We implemented our proposed model, called RWEN, using the Py Torch (Paszke et al. 2019) and Transformers6 (Wolf et al. 2020) library. This mentions the libraries but not their specific version numbers. |
| Experiment Setup | Yes | Specifically, Phoneme Encoder and Melspectrogram Decoder are composed of four Feed-Forward Transformer (FFT) blocks (Ren et al. 2019) whose parameters are the same as described in Ren et al. (2021) except that the hidden size of the Mel-spectrogram decoder is 1024. Duration Predictor, Pitch Predictor, and Energy Predictor are the same architecture: two 1-D convolutions with kernel size 3 and 256/256 input/output channels, each followed by Re LU, Layer Norm, and Dropout with the probability of 0.1. ... The batch size is set to 2 per GPU, and the model is trained up to 200k steps. |