reproducibilityindex.ai

SelfVC: Voice Conversion With Iterative Refinement using Self Transformations

Authors: Paarth Neekhara, Shehzeen Samarah Hussain, Rafael Valle, Boris Ginsburg, Rishabh Ranjan, Shlomo Dubnov, Farinaz Koushanfar, Julian Mcauley

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our framework is trained without any text and achieves state-of-the-art results in zero-shot voice conversion on metrics evaluating naturalness, speaker similarity, and intelligibility of synthesized audio. 1
Researcher Affiliation	Collaboration	*Equal contribution 1NVIDIA 2UC San Diego. Correspondence to: Paarth Neekhara <pneekhara@nvidia.com>, Shehzeen Hussain <shehzeenh@nvidia.com>.
Pseudocode	Yes	Algorithm 1 details our grouping procedure to obtain duration-augmented content embeddings.
Open Source Code	No	The paper refers to an unofficial open-source implementation of a third-party method (NANSY) to explain the heuristic transformations used, but it does not provide code for the Self VC method developed in this paper. The provided link is: '2https://github.com/dhchoi99/NANSY/blob/master/datasets/functional.py'.
Open Datasets	Yes	The Conformer-SSL model used as the content encoder is pretrained on 56k hours of unlabelled English speech from the Libri Light (Kahn et al., 2020) corpus sampled at 16 KHz. We fine-tune the Conformer-SSL model (using self-supervision with contrastive and MLM loss) on the train-clean-360 subset of Libri TTS (Zen et al., 2019) dataset with audio sampled at 22050Hz... For our primary experiments, the mel-spectrogram synthesizer and the Hifi GAN vocoder are also trained on the train-clean-360 subset of the Libri TTS dataset which contains 360 hours of speech from 904 speakers.
Dataset Splits	No	The paper specifies datasets used for training (Libri TTS train-clean-360) and testing (Libri TTS test-clean, VCTK, CSS10), but it does not provide specific details about a dedicated validation split, such as its percentage or sample count, for reproducing the experiments.
Hardware Specification	Yes	Fine-tuning takes around 50 hours on a single NVIDIA RTX A6000 GPU... The training time for Synth (Self Transform) model is around 5 days on 4 NVIDIA RTX A6000 GPUs.
Software Dependencies	No	The paper mentions several models and optimizers like 'Conformer-SSL', 'Tita Net', 'Fast Pitch', 'Hi Fi GAN vocoder', and 'Adam W optimizer', along with their respective citations. However, it does not specify concrete version numbers for any of these software components or underlying libraries (e.g., PyTorch, TensorFlow versions) that would be needed for replication.
Experiment Setup	Yes	We fine-tune the Conformer-SSL... for 50 epochs with a batch size of 32 using the Adam W optimizer with a fixed learning rate of 5e 5 and β1 = 0.9, β2 = 0.99... All three variants of the synthesizer... are optimized using an Adam W optimizer (Loshchilov & Hutter, 2019) with a fixed learning rate of 1e 4 and β1 = 0.8, β2 = 0.99 for 500 epochs with a batch size of 32. The threshold τ for duration extraction is set as 0.925. The loss coefficients for the duration and pitch loss are set as λ1 = λ2 = 0.1.