SelfVC: Voice Conversion With Iterative Refinement using Self Transformations
Authors: Paarth Neekhara, Shehzeen Samarah Hussain, Rafael Valle, Boris Ginsburg, Rishabh Ranjan, Shlomo Dubnov, Farinaz Koushanfar, Julian Mcauley
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our framework is trained without any text and achieves state-of-the-art results in zero-shot voice conversion on metrics evaluating naturalness, speaker similarity, and intelligibility of synthesized audio. 1 |
| Researcher Affiliation | Collaboration | *Equal contribution 1NVIDIA 2UC San Diego. Correspondence to: Paarth Neekhara <pneekhara@nvidia.com>, Shehzeen Hussain <shehzeenh@nvidia.com>. |
| Pseudocode | Yes | Algorithm 1 details our grouping procedure to obtain duration-augmented content embeddings. |
| Open Source Code | No | The paper refers to an unofficial open-source implementation of a *third-party method* (NANSY) to explain the heuristic transformations used, but it does not provide code for the Self VC method developed in this paper. The provided link is: '2https://github.com/dhchoi99/NANSY/blob/master/datasets/functional.py'. |
| Open Datasets | Yes | The Conformer-SSL model used as the content encoder is pretrained on 56k hours of unlabelled English speech from the Libri Light (Kahn et al., 2020) corpus sampled at 16 KHz. We fine-tune the Conformer-SSL model (using self-supervision with contrastive and MLM loss) on the train-clean-360 subset of Libri TTS (Zen et al., 2019) dataset with audio sampled at 22050Hz... For our primary experiments, the mel-spectrogram synthesizer and the Hifi GAN vocoder are also trained on the train-clean-360 subset of the Libri TTS dataset which contains 360 hours of speech from 904 speakers. |
| Dataset Splits | No | The paper specifies datasets used for training (Libri TTS train-clean-360) and testing (Libri TTS test-clean, VCTK, CSS10), but it does not provide specific details about a dedicated validation split, such as its percentage or sample count, for reproducing the experiments. |
| Hardware Specification | Yes | Fine-tuning takes around 50 hours on a single NVIDIA RTX A6000 GPU... The training time for Synth (Self Transform) model is around 5 days on 4 NVIDIA RTX A6000 GPUs. |
| Software Dependencies | No | The paper mentions several models and optimizers like 'Conformer-SSL', 'Tita Net', 'Fast Pitch', 'Hi Fi GAN vocoder', and 'Adam W optimizer', along with their respective citations. However, it does not specify concrete version numbers for any of these software components or underlying libraries (e.g., PyTorch, TensorFlow versions) that would be needed for replication. |
| Experiment Setup | Yes | We fine-tune the Conformer-SSL... for 50 epochs with a batch size of 32 using the Adam W optimizer with a fixed learning rate of 5e 5 and β1 = 0.9, β2 = 0.99... All three variants of the synthesizer... are optimized using an Adam W optimizer (Loshchilov & Hutter, 2019) with a fixed learning rate of 1e 4 and β1 = 0.8, β2 = 0.99 for 500 epochs with a batch size of 32. The threshold τ for duration extraction is set as 0.925. The loss coefficients for the duration and pitch loss are set as λ1 = λ2 = 0.1. |