PriorGrad: Improving Conditional Denoising Diffusion Models with Data-Dependent Adaptive Prior

Authors: Sang-gil Lee, Heeseung Kim, Chaehun Shin, Xu Tan, Chang Liu, Qi Meng, Tao Qin, Wei Chen, Sungroh Yoon, Tie-Yan Liu

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We implemented Prior Grad based on the recently proposed diffusion-based speech generative models (Kong et al., 2021; Chen et al., 2021; Jeong et al., 2021), and conducted experiments on the LJSpeech (Ito & Johnson, 2017) dataset. The experimental results demonstrate the benefits of Prior Grad, such as a significantly faster model convergence during training, improved perceptual quality, and an improved tolerance to a reduction in network capacity.
Researcher Affiliation Collaboration Sang-gil Lee1 Heeseung Kim1 Chaehun Shin1 Xu Tan2 Chang Liu2 Qi Meng2 Tao Qin2 Wei Chen2 Sungroh Yoon1,3 Tie-Yan Liu2 1Data Science & AI Lab., Seoul National University 2Microsoft Research Asia 3 AIIS, ASRI, INMC, ISRC, NSI, and Interdisciplinary Program in Artificial Intelligence, Seoul National University
Pseudocode Yes Algorithms 1 and 2 describe the training and sampling procedures augmented by the datadependent prior (µ, Σ).
Open Source Code No We followed the publicly available implementation3, where it uses a 2.62M parameter model... 3https://github.com/lmnt-com/diffwave - This link refers to a baseline implementation (Diff Wave), not explicitly the open-source code for the authors' proposed method (Prior Grad).
Open Datasets Yes We used LJSpeech (Ito & Johnson, 2017) dataset for all experiments, which is a commonly used open-source 24h speech dataset with 13,100 audio clips from a single female speaker.
Dataset Splits Yes We used 13,000 clips as the training set, 5 clips as the validation set, and the remaining 95 clips as the test set used for an objective and subjective audio quality evaluation.
Hardware Specification Yes Training for 1M iterations took approximately 7 days with a single NVIDIA A40 GPU. ... Training for 300K iterations took approximately 2 days on a single NVIDIA P100 GPU.
Software Dependencies No The paper mentions specific tools and libraries like Adam optimizer, Parallel Wave GAN, Hi Fi-GAN, MFA toolkit, SWIPE, and links to some open-source libraries, but does not provide specific version numbers for these software dependencies (e.g., PyTorch version, exact Adam version, or specific library versions like auraloss version X.Y).
Experiment Setup Yes We used the publicly available implementation3, where it uses a 2.62M parameter model with an Adam optimizer (Kingma & Ba, 2014) and a learning rate of 2 10 4 for a total of 1M iterations. ... We used the default diffusion steps with T = 50 and the linear beta schedule ranging from 1 10 4 to 5 10 2 for training and inference... We also used the fast Tinfer = 6 inference noise schedule... We conducted a comparative study of Prior Grad acoustic model with a different diffusion decoder network capacity, i.e., a small model with 3.5M parameters (128 residual channels), and a large model with 10M parameters (256 residual channels).