WaveFlow: A Compact Flow-based Model for Raw Audio

Authors: Wei Ping, Kainan Peng, Kexin Zhao, Zhao Song

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we compare likelihood-based generative models for raw audio in term of test likelihood, synthesis speed and speech fidelity. The results in this section are obtained from an internal Py Torch implementation. We use the LJ speech dataset (Ito, 2017) containing about 24 hours of audio with a sampling rate of 22.05 k Hz recorded on a Mac Book Pro in a home enviroment.
Researcher Affiliation Industry 1Baidu Research, 1195 Bordeaux Dr, Sunnyvale, CA.
Pseudocode No No pseudocode or algorithm block was found in the paper.
Open Source Code Yes Project page: https://waveflow-demo.github.io/. Correspondence to: Wei Ping <weiping.thu@gmail.com>. We provide a Paddle Paddle reimplementation in Parakeet toolkit. 4https://github.com/PaddlePaddle/ Parakeet/tree/develop/examples/waveflow
Open Datasets Yes Data: We use the LJ speech dataset (Ito, 2017) containing about 24 hours of audio with a sampling rate of 22.05 k Hz recorded on a Mac Book Pro in a home enviroment. It consists of 13, 100 audio clips from a single female speaker.
Dataset Splits No Training: We train all models on 8 Nvidia 1080Ti GPUs using randomly chosen short clips of 16, 000 samples from each utterance. For Wave Flow and Wave Net, we use the Adam optimizer (Kingma & Ba, 2015) with a batch size of 8 and a constant learning rate of 2 10 4. For Wave Glow, we use the Adam optimizer with a batch size of 16 and a learning rate of 1 10 4. We apply weight normalization (Salimans & Kingma, 2016) whenever possible. 5.1. Likelihood We evaluate the test log-likelihoods (LLs) of Wave Flow, Wave Net, Wave Glow and autoregressive flow (AF) conditioned on mel spectrograms at 1M training steps.
Hardware Specification Yes It can generate 22.05 k Hz highfidelity audio 42.6 faster than real-time (at a rate of 939.3 k Hz) on a V100 GPU without engineered inference kernels. We train all models on 8 Nvidia 1080Ti GPUs using randomly chosen short clips of 16, 000 samples from each utterance.
Software Dependencies No The results in this section are obtained from an internal Py Torch implementation. We provide a Paddle Paddle reimplementation in Parakeet toolkit. We run synthesis under NVIDIA Apex with 16-bit floating point (FP16) arithmetic. No specific version numbers for PyTorch, PaddlePaddle, or NVIDIA Apex are provided.
Experiment Setup Yes For Wave Flow and Wave Net, we use the Adam optimizer (Kingma & Ba, 2015) with a batch size of 8 and a constant learning rate of 2 10 4. For Wave Glow, we use the Adam optimizer with a batch size of 16 and a learning rate of 1 10 4. We set FFT size to 1024, hop size to 256, and window size to 1024.