reproducibilityindex.ai

RWEN-TTS: Relation-Aware Word Encoding Network for Natural Text-to-Speech Synthesis

Authors: Shinhyeok Oh, HyeongRae Noh, Yoonseok Hong, Insoo Oh

AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results show substantial improvements compared to previous works.Experimental Setup Datasets We train and evaluate RWEN on LJSpeech (Ito and Johnson 2017), a single speaker corpus recorded by a female English speaker.
Researcher Affiliation	Industry	Shinhyeok Oh, Hyeong Rae Noh, Yoonseok Hong, and Insoo Oh Netmarble AI Center {kai, hr noh, yhong, ioh}@netmarble.com
Pseudocode	No	The paper contains architectural diagrams (Figures 2, 3, 4) but no structured pseudocode or algorithm blocks.
Open Source Code	Yes	More details and samples are in our repository7 and demonstration site8. with footnote 7: https://github.com/shinhyeokoh/rwen
Open Datasets	Yes	Datasets We train and evaluate RWEN on LJSpeech (Ito and Johnson 2017), a single speaker corpus recorded by a female English speaker.
Dataset Splits	Yes	It consists of 13,100 short audio clips with a total length of 24 hours, being randomly split into 12,500, 100, and 500 samples to comprise the training, validation, and test datasets as in Kim, Kong, and Son (2021).
Hardware Specification	Yes	We use mixed precision training on 16 Tesla A100 GPUs for all the experiments.
Software Dependencies	No	We implemented our proposed model, called RWEN, using the Py Torch (Paszke et al. 2019) and Transformers6 (Wolf et al. 2020) library. This mentions the libraries but not their specific version numbers.
Experiment Setup	Yes	Specifically, Phoneme Encoder and Melspectrogram Decoder are composed of four Feed-Forward Transformer (FFT) blocks (Ren et al. 2019) whose parameters are the same as described in Ren et al. (2021) except that the hidden size of the Mel-spectrogram decoder is 1024. Duration Predictor, Pitch Predictor, and Energy Predictor are the same architecture: two 1-D convolutions with kernel size 3 and 256/256 input/output channels, each followed by Re LU, Layer Norm, and Dropout with the probability of 0.1. ... The batch size is set to 2 per GPU, and the model is trained up to 200k steps.