AutoVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss
Authors: Kaizhi Qian, Yang Zhang, Shiyu Chang, Xuesong Yang, Mark Hasegawa-Johnson
ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we will evaluate AUTOVC on many-to-many voice conversion tasks, and empirically validate the assumptions of the AUTOVC framework. We performed two subjective tests on Amazon Mechanical Turk (MTurk). |
| Researcher Affiliation | Collaboration | 1University of Illinois at Urbana Champaign, IL, USA 2MIT-IBM Watson AI Lab, Cambridge, MA, USA 3IBM Research, Cambridge, MA, USA. |
| Pseudocode | No | The paper describes the architecture and process in text and diagrams but does not include any explicit pseudocode blocks or algorithm listings. |
| Open Source Code | No | The implementation will become publicly available. |
| Open Datasets | Yes | The evaluation is performed on the VCTK corpus (Veaux et al., 2016), which contains 44 hours of utterances from 109 speakers. In our implementation, the speaker encoder is pre-trained on the combination of Vox Celeb1 (Nagrani et al., 2017) and Librispeech (Panayotov et al., 2015) corpora, where there are a total of 3549 speakers. |
| Dataset Splits | No | The data of each speaker is then partitioned into training and test sets by 9:1. The paper mentions training and test sets but does not explicitly provide details for a validation split. |
| Hardware Specification | No | The paper does not explicitly describe the specific hardware (e.g., GPU/CPU models, memory) used to run its experiments. |
| Software Dependencies | No | The paper mentions various software components and algorithms (e.g., ADAM optimizer, Wave Net vocoder, LSTM, PyTorch), but it does not specify their version numbers for reproducibility. |
| Experiment Setup | Yes | AUTOVC is trained with a batch size of two for 100k steps, using the ADAM optimizer. The speaker embedding is generated by feeding 10 two-second utterances of the same speaker to the speaker encoder and averaging the resulting embeddings. The weights in Eq. (12) are set to λ = 1, µ = 1. |