Neural Voice Cloning with a Few Samples

Authors: Sercan Arik, Jitong Chen, Kainan Peng, Wei Ping, Yanqi Zhou

NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We study two approaches: speaker adaptation and speaker encoding. ...both approaches can achieve good performance, even with a few cloning audios. ...We propose automated evaluation methods for voice cloning based on neural speaker classification and speaker verification. ...4 Experiments ...4.3 Voice cloning performance ...Tables 2 and 3 show the results of human evaluations.
Researcher Affiliation Industry Sercan Ö. Arık sercanarik@baidu.com Jitong Chen chenjitong01@baidu.com Kainan Peng pengkainan@baidu.com Wei Ping pingwei01@baidu.com Yanqi Zhou yanqiz@baidu.com Baidu Research 1195 Bordeaux Dr. Sunnyvale, CA 94089
Pseudocode No The paper describes architectures and procedures but does not contain formal pseudocode or algorithm blocks.
Open Source Code No The paper mentions 'Cloned audio samples can be found in https://audiodemos.github.io' but does not provide a link or explicit statement about the source code for the methodology described.
Open Datasets Yes In our first set of experiments (Sections 4.3 and 4.4), the multi-speaker generative model and speaker encoder are trained using Libri Speech dataset [Panayotov et al., 2015], which contains audios (16 KHz) for 2484 speakers, totalling 820 hours. ... Voice cloning is performed on VCTK dataset [Veaux et al., 2017].
Dataset Splits Yes Validation set consists 25 held-out speakers. ... We split the VCTK dataset for training and testing: 84 speakers are used for training the multi-speaker model, 8 speakers for validation, and 16 speakers for cloning.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies No The paper mentions the use of 'Griffin-Lim vocoder' and 'Deep Voice 3' as baseline models but does not provide specific version numbers for any software dependencies.
Experiment Setup Yes To get better performance, we increase the time-resolution by reducing the hop length and window size parameters to 300 and 1200, and add a quadratic loss term to penalize large amplitude components superlinearly. For speaker adaptation experiments, we reduce the embedding dimensionality to 128. ... Initially, cloning audios are converted to log-mel spectrograms with 80 frequency bands, with a hop length of 400, a window size of 1600. Log-mel spectrograms are fed to spectral processing layers, which are composed of 2-layer prenet of size 128. Then, temporal processing is applied with two 1-D convolutional layers with a filter width of 12. Finally, multi-head attention is applied with 2 heads and a unit size of 128 for keys, queries and values. The final embedding size is 512. ... A batch size of 64 is used, with an initial learning rate of 0.0006 with annealing rate of 0.6 applied every 8000 iterations.