PriorGrad: Improving Conditional Denoising Diffusion Models with Data-Dependent Adaptive Prior
Authors: Sang-gil Lee, Heeseung Kim, Chaehun Shin, Xu Tan, Chang Liu, Qi Meng, Tao Qin, Wei Chen, Sungroh Yoon, Tie-Yan Liu
ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We implemented Prior Grad based on the recently proposed diffusion-based speech generative models (Kong et al., 2021; Chen et al., 2021; Jeong et al., 2021), and conducted experiments on the LJSpeech (Ito & Johnson, 2017) dataset. The experimental results demonstrate the benefits of Prior Grad, such as a significantly faster model convergence during training, improved perceptual quality, and an improved tolerance to a reduction in network capacity. |
| Researcher Affiliation | Collaboration | Sang-gil Lee1 Heeseung Kim1 Chaehun Shin1 Xu Tan2 Chang Liu2 Qi Meng2 Tao Qin2 Wei Chen2 Sungroh Yoon1,3 Tie-Yan Liu2 1Data Science & AI Lab., Seoul National University 2Microsoft Research Asia 3 AIIS, ASRI, INMC, ISRC, NSI, and Interdisciplinary Program in Artificial Intelligence, Seoul National University |
| Pseudocode | Yes | Algorithms 1 and 2 describe the training and sampling procedures augmented by the datadependent prior (µ, Σ). |
| Open Source Code | No | We followed the publicly available implementation3, where it uses a 2.62M parameter model... 3https://github.com/lmnt-com/diffwave - This link refers to a baseline implementation (Diff Wave), not explicitly the open-source code for the authors' proposed method (Prior Grad). |
| Open Datasets | Yes | We used LJSpeech (Ito & Johnson, 2017) dataset for all experiments, which is a commonly used open-source 24h speech dataset with 13,100 audio clips from a single female speaker. |
| Dataset Splits | Yes | We used 13,000 clips as the training set, 5 clips as the validation set, and the remaining 95 clips as the test set used for an objective and subjective audio quality evaluation. |
| Hardware Specification | Yes | Training for 1M iterations took approximately 7 days with a single NVIDIA A40 GPU. ... Training for 300K iterations took approximately 2 days on a single NVIDIA P100 GPU. |
| Software Dependencies | No | The paper mentions specific tools and libraries like Adam optimizer, Parallel Wave GAN, Hi Fi-GAN, MFA toolkit, SWIPE, and links to some open-source libraries, but does not provide specific version numbers for these software dependencies (e.g., PyTorch version, exact Adam version, or specific library versions like auraloss version X.Y). |
| Experiment Setup | Yes | We used the publicly available implementation3, where it uses a 2.62M parameter model with an Adam optimizer (Kingma & Ba, 2014) and a learning rate of 2 10 4 for a total of 1M iterations. ... We used the default diffusion steps with T = 50 and the linear beta schedule ranging from 1 10 4 to 5 10 2 for training and inference... We also used the fast Tinfer = 6 inference noise schedule... We conducted a comparative study of Prior Grad acoustic model with a different diffusion decoder network capacity, i.e., a small model with 3.5M parameters (128 residual channels), and a large model with 10M parameters (256 residual channels). |