Blow: a single-scale hyperconditioned flow for non-parallel raw-audio voice conversion
Authors: Joan Serrà, Santiago Pascual, Carlos Segura Perales
NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We quantify the effectiveness of Blow both objectively and subjectively, obtaining comparable or even better performance than a number of baselines. We also perform an ablation study to quantify the relative importance of every new component, and assess further aspects such as the preference for source/target speakers or the relation between objective scores and the amount of training audio. |
| Researcher Affiliation | Collaboration | Joan Serrà Telefónica Research joan.serra@telefonica.com Santiago Pascual Universitat Politècnica de Catalunya santi.pascual@upc.edu Carlos Segura Telefónica Research carlos.seguraperales @telefonica.com |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | We use public data and make our code available at https://github.com/joansj/blow. |
| Open Datasets | Yes | To study the performance of Blow we use the VCTK data set [51], which comprises 46 h of audio from 109 speakers. [51] C. Veaux, J. Yamagishi, and K. Mac Donald. CSTR VCTK corpus: English multi-speaker corpus for CSTR voice cloning toolkit, 2012. URL http://dx.doi.org/10.7488/ds/1994. |
| Dataset Splits | Yes | We downsample it at 16 k Hz and randomly extract 10% of the sentences for validation and 10% for testing (we use a simple parsing script to ensure that the same sentence text does not get into different splits, see Appendix B). |
| Hardware Specification | Yes | With this amount of data, the training of Blow takes 13 days using three Ge Force RTX 2080-Ti GPUs |
| Software Dependencies | No | The paper mentions 'Py Torch [48]' but does not specify a version number for it or other software dependencies. |
| Experiment Setup | Yes | We train Blow with Adam using a learning rate of 10 4 and a batch size of 114. We anneal the learning rate by a factor of 5 if 10 epochs have passed without improvement in the validation set, and stop training at the third time this happens. We use an 8 12 structure, with 2 alternate-pattern squeezing operations. For the coupling network, we split channels into two halves, and use one-dimensional convolutions with 512 filters and kernel widths 3, 1, and 3. Embeddings are of dimension 128. We train with a frame size of 4096 at 16 k Hz with no overlap, and initialize the Act Norm weights with one data-augmented batch (batches contain a random mixture of frames from all speakers). |