Improving Zero-Shot Voice Style Transfer via Disentangled Representation Learning
Authors: Siyang Yuan, Pengyu Cheng, Ruiyi Zhang, Weituo Hao, Zhe Gan, Lawrence Carin
ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | On real-world VCTK datasets, our method outperforms other baselines and obtains state-of-the-art results in terms of transfer accuracy and voice naturalness for voice style transfer experiments under both many-to-many and zero-shot setups. Experiments demonstrate that our method outperforms previous works under both many-to-many and zero-shot transfer setups on two objective metrics and two subjective metrics. |
| Researcher Affiliation | Collaboration | Siyang Yuan1 , Pengyu Cheng1 , Ruiyi Zhang1, Weituo Hao1, Zhe Gan2 and Lawrence Carin1 1Duke University, Durham, North Carolina, USA 2Microsoft, Redmond, Washington, USA |
| Pseudocode | No | The paper provides mathematical formulations and descriptions of the model architecture, but it does not include pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | No | The paper references official implementations of baseline models (e.g., "For Blow model, we use the official implementation available on Github (https://github.com/joansj/blow).") but does not provide a link or explicit statement about releasing the source code for its own proposed method (IDE-VC). |
| Open Datasets | Yes | The selected dataset is CSTR Voice Cloning Toolkit (VCTK) (Yamagishi et al., 2019), which includes 46 hours of audio from 109 speakers. |
| Dataset Splits | Yes | For the many-to-many VST task, we randomly select 10% of the sentences for validation and 10% of the sentences for testing from the VCTK dataset, following the setting in Blow (Serr a et al., 2019). The rest of the data are used for training in a non-parallel scheme. |
| Hardware Specification | Yes | Our model is implememted with Pytorch and takes 1 GPU day on an Nvidia Xp to train. |
| Software Dependencies | No | The paper mentions software like Pytorch, Wave Net vocoder, and Resemblyzer, but does not specify their version numbers, which is required for reproducibility. |
| Experiment Setup | Yes | The number of mel-frequency bins is set as 80. The speaker encoder consists of a 2-layer long short-term memory (LSTM) with cell size of 768, and a fully-connected layer with output dimension 256. The speaker encoder is initialized with weights from a pretrained GE2E (Wan et al., 2018) encoder. The content encoder consists of three convolutional layers with 512 channels, and two layers of a bidirectional LSTM with cell dimension 32. Following the setup in AUTOVC (Qian et al., 2019), the forward and backward outputs of the bi-directional LSTM are downsampled by 16. The initial decoder consists of a three-layer convolutional neural network (CNN) with 512 channels, three LSTM layers with cell dimension 1024, and another convolutional layer to project the output of the LSTM to dimension of 80. To enhance the quality of the spectrogram, following AUTOVC (Qian et al., 2019), we use a post-network consisting of five convolutional layers with 512 channels for the first four layers, and 80 channels for the last layer. The reconstruction loss is applied to both the output of the initial decoder and the final conversion signal. For implementation, we parameterize the variational distribution in the Gaussian distribution family qθ(s|c) = N(µθ(c), σ2 θ(c) I), where mean µθ( ) and variance σ2 θ( ) are two-layer fully-connected networks with tanh( ) as the activation function. |