ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech
Authors: Wei Ping, Kainan Peng, Jitong Chen
ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we present several experiments to evaluate the proposed parallel wave generation method and text-to-wave architecture. We report the mean opinion score (MOS) for naturalness evaluation in Table 1. |
| Researcher Affiliation | Industry | Wei Ping Kainan Peng Jitong Chen {pingwei01, pengkainan, chenjitong01}@baidu.com Baidu Research 1195 Bordeaux Dr, Sunnyvale, CA 94089 |
| Pseudocode | Yes | Algorithm 1 Gaussian Inverse Autoregressive Flows as Student Network |
| Open Source Code | No | 1Audio samples are in https://clarinet-demo.github.io/. This link points to audio samples/demos, not the source code for the methodology described in the paper. |
| Open Datasets | No | Data: We use an internal English speech dataset containing about 20 hours of audio from a female speaker with a sampling rate of 48 k Hz. We downsample the audios to 24 k Hz. |
| Dataset Splits | No | While the paper mentions 'validation likelihood' in Appendix A and performs evaluation on 'test audios', it does not provide explicit details about the train/validation/test split percentages, sample counts, or the methodology for partitioning the internal dataset to allow for reproduction of the data splits. |
| Hardware Specification | Yes | At inference, the parallel student-net runs 20 times faster than real time on NVIDIA Ge Force GTX 1080 Ti. |
| Software Dependencies | No | All models share the same architecture except the output distributions, and they are trained for 1000K steps using the Adam optimizer (Kingma and Ba, 2015) with batch-size 8 and 0.5s audio clips. The paper mentions the Adam optimizer but does not provide specific version numbers for any software libraries, frameworks, or languages used. |
| Experiment Setup | Yes | All models share the same architecture except the output distributions, and they are trained for 1000K steps using the Adam optimizer (Kingma and Ba, 2015) with batch-size 8 and 0.5s audio clips. The learning rate is set to 0.001 in the beginning and annealed by half for every 200K steps. We set the dropout probability to 0.05 in all experiments. |