Neural Voice Cloning with a Few Samples
Authors: Sercan Arik, Jitong Chen, Kainan Peng, Wei Ping, Yanqi Zhou
NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We study two approaches: speaker adaptation and speaker encoding. ...both approaches can achieve good performance, even with a few cloning audios. ...We propose automated evaluation methods for voice cloning based on neural speaker classification and speaker verification. ...4 Experiments ...4.3 Voice cloning performance ...Tables 2 and 3 show the results of human evaluations. |
| Researcher Affiliation | Industry | Sercan Ö. Arık sercanarik@baidu.com Jitong Chen chenjitong01@baidu.com Kainan Peng pengkainan@baidu.com Wei Ping pingwei01@baidu.com Yanqi Zhou yanqiz@baidu.com Baidu Research 1195 Bordeaux Dr. Sunnyvale, CA 94089 |
| Pseudocode | No | The paper describes architectures and procedures but does not contain formal pseudocode or algorithm blocks. |
| Open Source Code | No | The paper mentions 'Cloned audio samples can be found in https://audiodemos.github.io' but does not provide a link or explicit statement about the source code for the methodology described. |
| Open Datasets | Yes | In our first set of experiments (Sections 4.3 and 4.4), the multi-speaker generative model and speaker encoder are trained using Libri Speech dataset [Panayotov et al., 2015], which contains audios (16 KHz) for 2484 speakers, totalling 820 hours. ... Voice cloning is performed on VCTK dataset [Veaux et al., 2017]. |
| Dataset Splits | Yes | Validation set consists 25 held-out speakers. ... We split the VCTK dataset for training and testing: 84 speakers are used for training the multi-speaker model, 8 speakers for validation, and 16 speakers for cloning. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory amounts, or detailed computer specifications) used for running its experiments. |
| Software Dependencies | No | The paper mentions the use of 'Griffin-Lim vocoder' and 'Deep Voice 3' as baseline models but does not provide specific version numbers for any software dependencies. |
| Experiment Setup | Yes | To get better performance, we increase the time-resolution by reducing the hop length and window size parameters to 300 and 1200, and add a quadratic loss term to penalize large amplitude components superlinearly. For speaker adaptation experiments, we reduce the embedding dimensionality to 128. ... Initially, cloning audios are converted to log-mel spectrograms with 80 frequency bands, with a hop length of 400, a window size of 1600. Log-mel spectrograms are fed to spectral processing layers, which are composed of 2-layer prenet of size 128. Then, temporal processing is applied with two 1-D convolutional layers with a filter width of 12. Finally, multi-head attention is applied with 2 heads and a unit size of 128 for keys, queries and values. The final embedding size is 512. ... A batch size of 64 is used, with an initial learning rate of 0.0006 with annealing rate of 0.6 applied every 8000 iterations. |