CoVoMix: Advancing Zero-Shot Speech Generation for Human-like Multi-talker Conversations

Authors: Leying Zhang, Yao Qian, Long Zhou, Shujie LIU, Dongmei Wang, Xiaofei Wang, Midia Yousefi, Yanmin Qian, Jinyu Li, Lei He, sheng zhao, Michael Zeng

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experimental results show that Co Vo Mix can generate dialogues that are not only human-like in their naturalness and coherence but also involve multiple talkers engaging in multiple rounds of conversation.
Researcher Affiliation Collaboration 1Shanghai Jiao Tong University, China 2Microsoft, USA
Pseudocode Yes Algorithm 1 illustrates the dialogue data preparation pipeline, introduced in Section4.1.
Open Source Code Yes The data processing script, along with the model training and inference codes are publicly available 3. 3https://github.com/vivian556123/Neur IPS2024-Co Vo Mix.git
Open Datasets Yes The dataset used in this work is Fisher dataset [20], which is a telephone conversation dataset with 2,000h English conversations about various topics.
Dataset Splits Yes We randomly divide the Fisher dataset into train/valid/test sets with 97/1/2 split.
Hardware Specification Yes We use 8 NVIDIA TESLA V100 32GB GPUs for training.
Software Dependencies No The paper mentions several models and tools used (e.g., BERT, Hu BERT, Hi Fi-GAN, Wav LM-TDNN) and their corresponding references, but does not explicitly state specific version numbers for general software dependencies like Python, PyTorch, or other libraries used for implementation.
Experiment Setup Yes The text-to-semantic model is trained for 10 epochs with batch size 48. The acoustic model and duration model are trained for 100 epochs with batch size 64. We adopt Adam optimizer with 1e-4 learning rate. The probability of dropping condition during training is puncond = 0.3, and the strength of guidance is α = 0.7 during inference.