reproducibilityindex.ai

AutoVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss

Authors: Kaizhi Qian, Yang Zhang, Shiyu Chang, Xuesong Yang, Mark Hasegawa-Johnson

ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we will evaluate AUTOVC on many-to-many voice conversion tasks, and empirically validate the assumptions of the AUTOVC framework. We performed two subjective tests on Amazon Mechanical Turk (MTurk).
Researcher Affiliation	Collaboration	1University of Illinois at Urbana Champaign, IL, USA 2MIT-IBM Watson AI Lab, Cambridge, MA, USA 3IBM Research, Cambridge, MA, USA.
Pseudocode	No	The paper describes the architecture and process in text and diagrams but does not include any explicit pseudocode blocks or algorithm listings.
Open Source Code	No	The implementation will become publicly available.
Open Datasets	Yes	The evaluation is performed on the VCTK corpus (Veaux et al., 2016), which contains 44 hours of utterances from 109 speakers. In our implementation, the speaker encoder is pre-trained on the combination of Vox Celeb1 (Nagrani et al., 2017) and Librispeech (Panayotov et al., 2015) corpora, where there are a total of 3549 speakers.
Dataset Splits	No	The data of each speaker is then partitioned into training and test sets by 9:1. The paper mentions training and test sets but does not explicitly provide details for a validation split.
Hardware Specification	No	The paper does not explicitly describe the specific hardware (e.g., GPU/CPU models, memory) used to run its experiments.
Software Dependencies	No	The paper mentions various software components and algorithms (e.g., ADAM optimizer, Wave Net vocoder, LSTM, PyTorch), but it does not specify their version numbers for reproducibility.
Experiment Setup	Yes	AUTOVC is trained with a batch size of two for 100k steps, using the ADAM optimizer. The speaker embedding is generated by feeding 10 two-second utterances of the same speaker to the speaker encoder and averaging the resulting embeddings. The weights in Eq. (12) are set to λ = 1, µ = 1.