Blow: a single-scale hyperconditioned flow for non-parallel raw-audio voice conversion

Authors: Joan Serrà, Santiago Pascual, Carlos Segura Perales

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We quantify the effectiveness of Blow both objectively and subjectively, obtaining comparable or even better performance than a number of baselines. We also perform an ablation study to quantify the relative importance of every new component, and assess further aspects such as the preference for source/target speakers or the relation between objective scores and the amount of training audio.
Researcher Affiliation Collaboration Joan Serrà Telefónica Research joan.serra@telefonica.com Santiago Pascual Universitat Politècnica de Catalunya santi.pascual@upc.edu Carlos Segura Telefónica Research carlos.seguraperales @telefonica.com
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes We use public data and make our code available at https://github.com/joansj/blow.
Open Datasets Yes To study the performance of Blow we use the VCTK data set [51], which comprises 46 h of audio from 109 speakers. [51] C. Veaux, J. Yamagishi, and K. Mac Donald. CSTR VCTK corpus: English multi-speaker corpus for CSTR voice cloning toolkit, 2012. URL http://dx.doi.org/10.7488/ds/1994.
Dataset Splits Yes We downsample it at 16 k Hz and randomly extract 10% of the sentences for validation and 10% for testing (we use a simple parsing script to ensure that the same sentence text does not get into different splits, see Appendix B).
Hardware Specification Yes With this amount of data, the training of Blow takes 13 days using three Ge Force RTX 2080-Ti GPUs
Software Dependencies No The paper mentions 'Py Torch [48]' but does not specify a version number for it or other software dependencies.
Experiment Setup Yes We train Blow with Adam using a learning rate of 10 4 and a batch size of 114. We anneal the learning rate by a factor of 5 if 10 epochs have passed without improvement in the validation set, and stop training at the third time this happens. We use an 8 12 structure, with 2 alternate-pattern squeezing operations. For the coupling network, we split channels into two halves, and use one-dimensional convolutions with 512 filters and kernel widths 3, 1, and 3. Embeddings are of dimension 128. We train with a frame size of 4096 at 16 k Hz with no overlap, and initialize the Act Norm weights with one data-augmented batch (batches contain a random mixture of frames from all speakers).