AutoVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss

Authors: Kaizhi Qian, Yang Zhang, Shiyu Chang, Xuesong Yang, Mark Hasegawa-Johnson

ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we will evaluate AUTOVC on many-to-many voice conversion tasks, and empirically validate the assumptions of the AUTOVC framework. We performed two subjective tests on Amazon Mechanical Turk (MTurk).
Researcher Affiliation Collaboration 1University of Illinois at Urbana Champaign, IL, USA 2MIT-IBM Watson AI Lab, Cambridge, MA, USA 3IBM Research, Cambridge, MA, USA.
Pseudocode No The paper describes the architecture and process in text and diagrams but does not include any explicit pseudocode blocks or algorithm listings.
Open Source Code No The implementation will become publicly available.
Open Datasets Yes The evaluation is performed on the VCTK corpus (Veaux et al., 2016), which contains 44 hours of utterances from 109 speakers. In our implementation, the speaker encoder is pre-trained on the combination of Vox Celeb1 (Nagrani et al., 2017) and Librispeech (Panayotov et al., 2015) corpora, where there are a total of 3549 speakers.
Dataset Splits No The data of each speaker is then partitioned into training and test sets by 9:1. The paper mentions training and test sets but does not explicitly provide details for a validation split.
Hardware Specification No The paper does not explicitly describe the specific hardware (e.g., GPU/CPU models, memory) used to run its experiments.
Software Dependencies No The paper mentions various software components and algorithms (e.g., ADAM optimizer, Wave Net vocoder, LSTM, PyTorch), but it does not specify their version numbers for reproducibility.
Experiment Setup Yes AUTOVC is trained with a batch size of two for 100k steps, using the ADAM optimizer. The speaker embedding is generated by feeding 10 two-second utterances of the same speaker to the speaker encoder and averaging the resulting embeddings. The weights in Eq. (12) are set to λ = 1, µ = 1.