Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

StableVC: Style Controllable Zero-Shot Voice Conversion with Conditional Flow Matching

Authors: Jixun Yao, Yang Yuguang, Yu Pan, Ziqian Ning, Jianhao Ye, Hongbin Zhou, Lei Xie

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments demonstrate that our proposed Stable VC outperforms state-of-the-art baseline systems in zero-shot VC and achieves flexible control over timbre and style from different unseen speakers. Moreover, Stable VC offers approximately 25 and 1.65 faster sampling compared to autoregressive and diffusion-based baselines. ... We conduct a comparative analysis of the performance in zero-shot voice conversion between our proposed Stable VC approach and several baseline systems... The evaluation results of both subjective and objective metrics are shown in Table 1.
Researcher Affiliation	Collaboration	Jixun Yao1, Yang Yuguang2, Yu Pan2, Ziqian Ning1, Jianhao Ye2, Hongbin Zhou2, Lei Xie1* 1 Audio, Speech and Language Processing Group (ASLP@NPU), Northwestern Polytechnical University, China 2Ximalaya Inc, China EMAIL, EMAIL
Pseudocode	No	The paper describes the model architecture and components in detail and includes figures (Figure 2: The overall framework of Stable VC, Figure 3: Details of Dual AGC in the Di T block) to illustrate the process, but it does not contain any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	No	Audio samples can be found in demo pages 1. 1https://yaoxunji.github.io/stablevc/. This link refers to demo pages for audio samples, not an explicit code repository for the methodology described in the paper. The paper does not contain a statement about releasing code or a direct link to a code repository.
Open Datasets	Yes	We conduct our experiments on the Libri Light dataset (Kahn et al. 2020)... For evaluation, we use the VCTK corpus (Veaux et al. 2019) and ESD corpus (Zhou et al. 2022)...
Dataset Splits	No	For model training, we use samples longer than 5 seconds and filter out low-quality samples using DNSMOS P.808 scores (Reddy, Gopal, and Cutler 2022), resulting in a 20k hours subset. For evaluation, we use the VCTK corpus (Veaux et al. 2019) and ESD corpus (Zhou et al. 2022), ensuring no speaker or style overlap with the training data. The paper describes data preparation and the datasets used for training and evaluation, but it does not specify explicit training/validation/test splits (e.g., 80/10/10 percentages or sample counts) for any single dataset.
Hardware Specification	Yes	We train Stable VC on 8 NVIDIA 3090 GPUs for 800K iterations... During inference, we sample the target mel-spectrograms... we calculate the real-time factor (RTF) on a single NVIDIA 3090 GPU...
Software Dependencies	No	The mel-spectrograms are reconstructed to waveform by Hifi-GAN vocoder (Kong, Kim, and Bae 2020). We use a pre-trained Wav LM model2 and apply K-means clustering... we extract a global speaker embedding from a pre-trained speaker verification model... we use a pretrained CTC-based ASR model6... The paper mentions specific tools and models used (e.g., Hifi-GAN, Wav LM, CTC-based ASR model) but does not provide specific version numbers for these software components or any other libraries/frameworks.
Experiment Setup	Yes	We train Stable VC on 8 NVIDIA 3090 GPUs for 800K iterations with a total batch size of 128 and the Adam W optimizer is used with a learning rate of 0.0001. During inference, we sample the target mel-spectrograms using 10 Euler steps in the flow matching module with a guidance scale of 1. ...where λ is the hyper-parameter used to balance the loss term and we set λ = 0.1.