Diffusion-Based Voice Conversion with Fast Maximum Likelihood Sampling Scheme
Authors: Vadim Popov, Ivan Vovk, Vladimir Gogoryan, Tasnima Sadekova, Mikhail Sergeevich Kudinov, Jiansheng Wei
ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We present a scalable high-quality solution based on diffusion probabilistic modeling and demonstrate its superior quality compared to state-of-the-art one-shot voice conversion approaches. Moreover, focusing on real-time applications, we investigate general principles which can make diffusion models faster while keeping synthesis quality at a high level. As a result, we develop a novel Stochastic Differential Equations solver suitable for various diffusion model types and generative tasks as shown through empirical studies and justify it by theoretical analysis. |
| Researcher Affiliation | Industry | Vadim Popov, Ivan Vovk, Vladimir Gogoryan, Huawei Noah s Ark Lab, Moscow, Russia {vadim.popov,vovk.ivan,gogoryan.vladimir}@huawei.com Tasnima Sadekova, Mikhail Kudinov & Jiansheng Wei Huawei Noah s Ark Lab, Moscow, Russia {sadekova.tasnima,kudinov.mikhail,weijiansheng}@huawei.com |
| Pseudocode | No | The paper includes mathematical equations and descriptions of the algorithms, but it does not feature any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The code will soon be published at https://github.com/huawei-noah/ Speech-Backbones. |
| Open Datasets | Yes | We trained two groups of models: Diff-VCTK models on VCTK (Yamagishi et al., 2019) dataset containing 109 speakers (9 speakers were held out for testing purposes) and Diff-Libri TTS models on Libri TTS (Zen et al., 2019) containing approximately 1100 speakers (10 speakers were held out). |
| Dataset Splits | Yes | We trained two groups of models: Diff-VCTK models on VCTK (Yamagishi et al., 2019) dataset containing 109 speakers (9 speakers were held out for testing purposes) and Diff-Libri TTS models on Libri TTS (Zen et al., 2019) containing approximately 1100 speakers (10 speakers were held out). For every model both encoder and decoder were trained on the same dataset. ... In all AMT tests we considered unseen-to-unseen conversion with 25 unseen (for both Diff-VCTK and Diff-Libri TTS) speakers: 9 VCTK speakers, 10 Libri TTS speakers and 6 internal speakers. |
| Hardware Specification | No | To fit GPU memory, decoders were trained on random speech segments of approximately 1.5 seconds rather than on the whole utterances. The paper mentions utilizing GPU memory but does not provide any specific details about the GPU models (e.g., NVIDIA A100, RTX 2080 Ti), CPU models, or any other hardware specifications used for the experiments. |
| Software Dependencies | No | The paper does not explicitly list any specific software dependencies with their version numbers (e.g., Python 3.8, PyTorch 1.9) required to reproduce the experiments. It only mentions using the pre-trained universal Hi Fi-GAN vocoder, but without a version. |
| Experiment Setup | Yes | Training hyperparameters, implementation and data processing details can be found in Appendix I. ... Encoders and decoders were trained with batch sizes 128 and 32 and Adam optimizer with initial learning rates 0.0005 and 0.0001 correspondingly. Encoders and decoders in VCTK models were trained for 500 and 200 epochs respectively; as for Libri TTS models, they were trained for 300 and 110 epochs. ... Noise schedule parameters β0 and β1 were set to 0.05 and 20.0. |