WaveGrad: Estimating Gradients for Waveform Generation

Authors: Nanxin Chen, Yu Zhang, Heiga Zen, Ron J Weiss, Mohammad Norouzi, William Chan

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments reveal Wave Grad to generate high fidelity audio, outperforming adversarial non-autoregressive baselines and matching a strong likelihood-based autoregressive baseline using fewer sequential operations.
Researcher Affiliation Collaboration Nanxin Chen Johns Hopkins University, Center for Language and Speech Processing bobchennan@jhu.edu Yu Zhang , Heiga Zen, Ron J. Weiss, Mohammad Norouzi, William Chan Google Research, Brain Team {ngyuzh,heigazen,ronw,mnorouzi,williamchan}@google.com
Pseudocode Yes Algorithm 1 Training. Wave Grad directly conditions on the continuous noise level α. ... Algorithm 2 Sampling. Wave Grad generates samples following a gradient-based sampler similar to Langevin dynamics.
Open Source Code No The paper provides a link to audio samples (https://wavegrad.github.io/) and mentions a public implementation for baselines (https://github.com/kan-bayashi/Parallel Wave GAN), but does not explicitly state that the source code for Wave Grad is open-source or provide a link to it.
Open Datasets Yes We ran experiments using the LJ Speech dataset (Ito & Johnson, 2017), a publicly available dataset consisting of audiobook recordings that were segmented into utterances of up to 10 seconds.
Dataset Splits Yes We used a validation set of 50 utterances for objective evaluation, including audio samples from multiple speakers.
Hardware Specification Yes Models were trained on using 32 Tensor Processing Unit (TPU) v2 cores. ... trained the model using 128 TPU v3 cores. ... achieved a real time factor (RTF) of 0.2 on an NVIDIA V100 GPU ... and RTF = 1.5 on an Intel Xeon CPU (16 cores, 2.3GHz).
Software Dependencies No The paper does not explicitly list specific software dependencies with version numbers (e.g., Python 3.x, PyTorch 1.x, CUDA 11.x).
Experiment Setup Yes The Wave Grad Base model took 24 frames corresponding to 0.3 seconds of audio (7,200 samples) as input during training. We set the batch size to 256. Models were trained on using 32 Tensor Processing Unit (TPU) v2 cores. The Wave Grad Base model contained 15M parameters. For the Wave Grad Large model, ... Each training sample included 60 frames corresponding to a 0.75 second of audio (18,000 samples). We used the same batch size and trained the model using 128 TPU v3 cores. The Wave Grad Large model contained 23M parameters. Both Base and Large models were trained for about 1M steps.