reproducibilityindex.ai

WaveFlow: A Compact Flow-based Model for Raw Audio

Authors: Wei Ping, Kainan Peng, Kexin Zhao, Zhao Song

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we compare likelihood-based generative models for raw audio in term of test likelihood, synthesis speed and speech ﬁdelity. The results in this section are obtained from an internal Py Torch implementation. We use the LJ speech dataset (Ito, 2017) containing about 24 hours of audio with a sampling rate of 22.05 k Hz recorded on a Mac Book Pro in a home enviroment.
Researcher Affiliation	Industry	1Baidu Research, 1195 Bordeaux Dr, Sunnyvale, CA.
Pseudocode	No	No pseudocode or algorithm block was found in the paper.
Open Source Code	Yes	Project page: https://waveflow-demo.github.io/. Correspondence to: Wei Ping <weiping.thu@gmail.com>. We provide a Paddle Paddle reimplementation in Parakeet toolkit. 4https://github.com/PaddlePaddle/ Parakeet/tree/develop/examples/waveflow
Open Datasets	Yes	Data: We use the LJ speech dataset (Ito, 2017) containing about 24 hours of audio with a sampling rate of 22.05 k Hz recorded on a Mac Book Pro in a home enviroment. It consists of 13, 100 audio clips from a single female speaker.
Dataset Splits	No	Training: We train all models on 8 Nvidia 1080Ti GPUs using randomly chosen short clips of 16, 000 samples from each utterance. For Wave Flow and Wave Net, we use the Adam optimizer (Kingma & Ba, 2015) with a batch size of 8 and a constant learning rate of 2 10 4. For Wave Glow, we use the Adam optimizer with a batch size of 16 and a learning rate of 1 10 4. We apply weight normalization (Salimans & Kingma, 2016) whenever possible. 5.1. Likelihood We evaluate the test log-likelihoods (LLs) of Wave Flow, Wave Net, Wave Glow and autoregressive ﬂow (AF) conditioned on mel spectrograms at 1M training steps.
Hardware Specification	Yes	It can generate 22.05 k Hz highﬁdelity audio 42.6 faster than real-time (at a rate of 939.3 k Hz) on a V100 GPU without engineered inference kernels. We train all models on 8 Nvidia 1080Ti GPUs using randomly chosen short clips of 16, 000 samples from each utterance.
Software Dependencies	No	The results in this section are obtained from an internal Py Torch implementation. We provide a Paddle Paddle reimplementation in Parakeet toolkit. We run synthesis under NVIDIA Apex with 16-bit ﬂoating point (FP16) arithmetic. No specific version numbers for PyTorch, PaddlePaddle, or NVIDIA Apex are provided.
Experiment Setup	Yes	For Wave Flow and Wave Net, we use the Adam optimizer (Kingma & Ba, 2015) with a batch size of 8 and a constant learning rate of 2 10 4. For Wave Glow, we use the Adam optimizer with a batch size of 16 and a learning rate of 1 10 4. We set FFT size to 1024, hop size to 256, and window size to 1024.