reproducibilityindex.ai

ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech

Authors: Wei Ping, Kainan Peng, Jitong Chen

ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we present several experiments to evaluate the proposed parallel wave generation method and text-to-wave architecture. We report the mean opinion score (MOS) for naturalness evaluation in Table 1.
Researcher Affiliation	Industry	Wei Ping Kainan Peng Jitong Chen {pingwei01, pengkainan, chenjitong01}@baidu.com Baidu Research 1195 Bordeaux Dr, Sunnyvale, CA 94089
Pseudocode	Yes	Algorithm 1 Gaussian Inverse Autoregressive Flows as Student Network
Open Source Code	No	1Audio samples are in https://clarinet-demo.github.io/. This link points to audio samples/demos, not the source code for the methodology described in the paper.
Open Datasets	No	Data: We use an internal English speech dataset containing about 20 hours of audio from a female speaker with a sampling rate of 48 k Hz. We downsample the audios to 24 k Hz.
Dataset Splits	No	While the paper mentions 'validation likelihood' in Appendix A and performs evaluation on 'test audios', it does not provide explicit details about the train/validation/test split percentages, sample counts, or the methodology for partitioning the internal dataset to allow for reproduction of the data splits.
Hardware Specification	Yes	At inference, the parallel student-net runs 20 times faster than real time on NVIDIA Ge Force GTX 1080 Ti.
Software Dependencies	No	All models share the same architecture except the output distributions, and they are trained for 1000K steps using the Adam optimizer (Kingma and Ba, 2015) with batch-size 8 and 0.5s audio clips. The paper mentions the Adam optimizer but does not provide specific version numbers for any software libraries, frameworks, or languages used.
Experiment Setup	Yes	All models share the same architecture except the output distributions, and they are trained for 1000K steps using the Adam optimizer (Kingma and Ba, 2015) with batch-size 8 and 0.5s audio clips. The learning rate is set to 0.001 in the beginning and annealed by half for every 200K steps. We set the dropout probability to 0.05 in all experiments.