CoVoMix: Advancing Zero-Shot Speech Generation for Human-like Multi-talker Conversations
Authors: Leying Zhang, Yao Qian, Long Zhou, Shujie LIU, Dongmei Wang, Xiaofei Wang, Midia Yousefi, Yanmin Qian, Jinyu Li, Lei He, sheng zhao, Michael Zeng
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experimental results show that Co Vo Mix can generate dialogues that are not only human-like in their naturalness and coherence but also involve multiple talkers engaging in multiple rounds of conversation. |
| Researcher Affiliation | Collaboration | 1Shanghai Jiao Tong University, China 2Microsoft, USA |
| Pseudocode | Yes | Algorithm 1 illustrates the dialogue data preparation pipeline, introduced in Section4.1. |
| Open Source Code | Yes | The data processing script, along with the model training and inference codes are publicly available 3. 3https://github.com/vivian556123/Neur IPS2024-Co Vo Mix.git |
| Open Datasets | Yes | The dataset used in this work is Fisher dataset [20], which is a telephone conversation dataset with 2,000h English conversations about various topics. |
| Dataset Splits | Yes | We randomly divide the Fisher dataset into train/valid/test sets with 97/1/2 split. |
| Hardware Specification | Yes | We use 8 NVIDIA TESLA V100 32GB GPUs for training. |
| Software Dependencies | No | The paper mentions several models and tools used (e.g., BERT, Hu BERT, Hi Fi-GAN, Wav LM-TDNN) and their corresponding references, but does not explicitly state specific version numbers for general software dependencies like Python, PyTorch, or other libraries used for implementation. |
| Experiment Setup | Yes | The text-to-semantic model is trained for 10 epochs with batch size 48. The acoustic model and duration model are trained for 100 epochs with batch size 64. We adopt Adam optimizer with 1e-4 learning rate. The probability of dropping condition during training is puncond = 0.3, and the strength of guidance is α = 0.7 during inference. |