Phoneme Hallucinator: One-Shot Voice Conversion via Set Expansion

Authors: Siyuan Shan, Yang Li, Amartya Banerjee, Junier B. Oliva

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Quantitative and qualitative evaluations show that Phoneme Hallucinator outperforms existing VC methods for both intelligibility and speaker similarity. In this section, we attempt to answer the following research questions: 1. Can our model outperform existing methods w.r.t. both intelligibility and speaker similarity when the target speech is short?
Researcher Affiliation Academia Department of Computer Science, University of North Carolina at Chapel Hill Chapel Hill, North Carolina, USA {siyuanshan, yangli95, amartya1, joliva}@cs.unc.edu
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes Codes and audio samples are available at https://phonemehallucinator.github.io/.
Open Datasets Yes To answer these questions, we train our model on Libri Speech train-clean-100 split and evaluate on Libri Speech test-clean split, which contains 40 speakers unseen during training and each speaker contains approximately 8 minutes of 16 k Hz English speech (Panayotov et al. 2015).
Dataset Splits Yes Libri Speech dev-clean split is used as the validation set. To train our model, we need to obtain an extra set of speech representations Xe conditioned on Xt. Equivalently, we can reformulate this task as obtaining a missing subset of speech representations Xe based on an observed subset of speech representations Xt. Both Xe and Xt are from the same speech utterance representation X with a fixed cardinality N where Xe Xt = and X = Xe Xt. We randomly select a subset of 200 represents to constitute X, which will be divided into the missing subset Xe and the observed subset Xt to train our model as discussed in Sec. 3.3.
Hardware Specification Yes Our Hallucinator is implemented by Tensor Flow and trained by Adam optimizer with a learning rate of 0.0001 and a batch size of 50 for 250 epochs on an NVIDIA RTX 4090 GPU. With an NVIDIA RTX 4090 GPU and an Intel i9-13900F CPU, our hallucinator only takes 1.48 seconds to hallucinate 30,000 new representations, which is approximately equivalent to 10 minutes of speech.
Software Dependencies No Our Hallucinator is implemented by Tensor Flow. The paper mentions TensorFlow but does not provide specific version numbers for it or other software dependencies.
Experiment Setup Yes Our Hallucinator is implemented by Tensor Flow and trained by Adam optimizer with a learning rate of 0.0001 and a batch size of 50 for 250 epochs on an NVIDIA RTX 4090 GPU. We use the pre-trained Wav LM-large encoder (Chen et al. 2022) to extract frame-level speech representations where a single 1024-dimensional vector is produced for every 20ms of 16 k Hz audio. For the k NN regressor, we set k equal to 4. All Set Transformers contain 4 multi-head attention blocks with a hidden size of 256 and 16 induction points. The sizes of θ and gi are 256. The size of z is 256 and both the decoder and the encoder in (3) is a 4-layer MLP with a hidden size of 512.