ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech

Authors: Wei Ping, Kainan Peng, Jitong Chen

ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we present several experiments to evaluate the proposed parallel wave generation method and text-to-wave architecture. We report the mean opinion score (MOS) for naturalness evaluation in Table 1.
Researcher Affiliation Industry Wei Ping Kainan Peng Jitong Chen {pingwei01, pengkainan, chenjitong01}@baidu.com Baidu Research 1195 Bordeaux Dr, Sunnyvale, CA 94089
Pseudocode Yes Algorithm 1 Gaussian Inverse Autoregressive Flows as Student Network
Open Source Code No 1Audio samples are in https://clarinet-demo.github.io/. This link points to audio samples/demos, not the source code for the methodology described in the paper.
Open Datasets No Data: We use an internal English speech dataset containing about 20 hours of audio from a female speaker with a sampling rate of 48 k Hz. We downsample the audios to 24 k Hz.
Dataset Splits No While the paper mentions 'validation likelihood' in Appendix A and performs evaluation on 'test audios', it does not provide explicit details about the train/validation/test split percentages, sample counts, or the methodology for partitioning the internal dataset to allow for reproduction of the data splits.
Hardware Specification Yes At inference, the parallel student-net runs 20 times faster than real time on NVIDIA Ge Force GTX 1080 Ti.
Software Dependencies No All models share the same architecture except the output distributions, and they are trained for 1000K steps using the Adam optimizer (Kingma and Ba, 2015) with batch-size 8 and 0.5s audio clips. The paper mentions the Adam optimizer but does not provide specific version numbers for any software libraries, frameworks, or languages used.
Experiment Setup Yes All models share the same architecture except the output distributions, and they are trained for 1000K steps using the Adam optimizer (Kingma and Ba, 2015) with batch-size 8 and 0.5s audio clips. The learning rate is set to 0.001 in the beginning and annealed by half for every 200K steps. We set the dropout probability to 0.05 in all experiments.