VoiceMixer: Adversarial Voice Style Mixup

Authors: Sang-Hoon Lee, Ji-Hoon Kim, Hyunseung Chung, Seong-Whan Lee

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluated our model with the VCTK dataset, which has 46 hours of audio from 109 speakers (Veaux et al., 2017). ... We conduct the naturalness and similarity mean opinion score test. ... We conduct three objective metrics; the equal error rate of the automatic speaker verification (ASV EER), the mel-cepstral distance (MCD13) (Kubichek, 1993), and the F0 root mean square error (RMSEf0). ... We conducted ablation studies for the information bottleneck and adversarial feedback in Table 3.
Researcher Affiliation Academia Sang-Hoon Lee1 Ji-Hoon Kim2 Hyunseung Chung2 Seong-Whan Lee2 {sh_lee, jihoon_kim, hs_chung, sw.lee}@korea.ac.kr 1Department of Brain and Cognitive Engineering, Korea University, Seoul, Korea 2Department of Artificial Intelligence, Korea University, Seoul, Korea
Pseudocode No The paper describes the architecture and methods in text and diagrams, but does not include explicit pseudocode or algorithm blocks.
Open Source Code Yes We attach the training and inference code at the supplemental material
Open Datasets Yes We evaluated our model with the VCTK dataset, which has 46 hours of audio from 109 speakers (Veaux et al., 2017).
Dataset Splits Yes We divided the dataset into 98 speakers as base speakers for many-to-many VST and 10 speakers as the novel speakers for zero-shot VST. The base speaker is split into train and test sets. For the non-parallel dataset setting, the training set consists of different utterances for all of the speakers, and the test set consists of 25 same utterances.
Hardware Specification No No specific hardware details (like GPU/CPU models or cloud instances) are provided in the main text of the paper.
Software Dependencies No The spectrogram is inverted to a waveform by the pre-trained Hi Fi-GAN (Kong et al., 2020). We used the Google Speech-to-Text API for ASR model. No version numbers for software.
Experiment Setup Yes The generator consists of a speaker encoder, content encoder, similarity-based information bottleneck, and decoder. We train the entire model jointly. The speaker embedding is extracted from the speaker encoder which has the same architecture as the reference encoder in (Skerry-Ryan et al., 2018). The source speech is fed to the content encoder consisting of a pre-net and three blocks of the multi-receptive field fusion (MRF) (Kong et al., 2020). The pre-net is two linear layers with 384 channels. ... We use the combination of two dilations of [1, 3], and two receptive fields of [3, 7] for the MRF. ... For the contrastive encoder, we use three masked convolution blocks of (Liu et al., 2020) with 384 channels, receptive field size of 23, and mask sizes of [5, 7, 9]. We set the k as 24 (about 0.3s)... The content discriminator consists of four blocks which have a speech-side and content-condition side block following (Lee et al., 2021). Each block has two 1D convolutional layers. The hidden representation of the condition-side block is added to the speech-side hidden representations of [256, 512, 1024, 1024].