WaveFlow: A Compact Flow-based Model for Raw Audio
Authors: Wei Ping, Kainan Peng, Kexin Zhao, Zhao Song
ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we compare likelihood-based generative models for raw audio in term of test likelihood, synthesis speed and speech fidelity. The results in this section are obtained from an internal Py Torch implementation. We use the LJ speech dataset (Ito, 2017) containing about 24 hours of audio with a sampling rate of 22.05 k Hz recorded on a Mac Book Pro in a home enviroment. |
| Researcher Affiliation | Industry | 1Baidu Research, 1195 Bordeaux Dr, Sunnyvale, CA. |
| Pseudocode | No | No pseudocode or algorithm block was found in the paper. |
| Open Source Code | Yes | Project page: https://waveflow-demo.github.io/. Correspondence to: Wei Ping <weiping.thu@gmail.com>. We provide a Paddle Paddle reimplementation in Parakeet toolkit. 4https://github.com/PaddlePaddle/ Parakeet/tree/develop/examples/waveflow |
| Open Datasets | Yes | Data: We use the LJ speech dataset (Ito, 2017) containing about 24 hours of audio with a sampling rate of 22.05 k Hz recorded on a Mac Book Pro in a home enviroment. It consists of 13, 100 audio clips from a single female speaker. |
| Dataset Splits | No | Training: We train all models on 8 Nvidia 1080Ti GPUs using randomly chosen short clips of 16, 000 samples from each utterance. For Wave Flow and Wave Net, we use the Adam optimizer (Kingma & Ba, 2015) with a batch size of 8 and a constant learning rate of 2 10 4. For Wave Glow, we use the Adam optimizer with a batch size of 16 and a learning rate of 1 10 4. We apply weight normalization (Salimans & Kingma, 2016) whenever possible. 5.1. Likelihood We evaluate the test log-likelihoods (LLs) of Wave Flow, Wave Net, Wave Glow and autoregressive flow (AF) conditioned on mel spectrograms at 1M training steps. |
| Hardware Specification | Yes | It can generate 22.05 k Hz highfidelity audio 42.6 faster than real-time (at a rate of 939.3 k Hz) on a V100 GPU without engineered inference kernels. We train all models on 8 Nvidia 1080Ti GPUs using randomly chosen short clips of 16, 000 samples from each utterance. |
| Software Dependencies | No | The results in this section are obtained from an internal Py Torch implementation. We provide a Paddle Paddle reimplementation in Parakeet toolkit. We run synthesis under NVIDIA Apex with 16-bit floating point (FP16) arithmetic. No specific version numbers for PyTorch, PaddlePaddle, or NVIDIA Apex are provided. |
| Experiment Setup | Yes | For Wave Flow and Wave Net, we use the Adam optimizer (Kingma & Ba, 2015) with a batch size of 8 and a constant learning rate of 2 10 4. For Wave Glow, we use the Adam optimizer with a batch size of 16 and a learning rate of 1 10 4. We set FFT size to 1024, hop size to 256, and window size to 1024. |