reproducibilityindex.ai

CoVoMix: Advancing Zero-Shot Speech Generation for Human-like Multi-talker Conversations

Authors: Leying Zhang, Yao Qian, Long Zhou, Shujie LIU, Dongmei Wang, Xiaofei Wang, Midia Yousefi, Yanmin Qian, Jinyu Li, Lei He, sheng zhao, Michael Zeng

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experimental results show that Co Vo Mix can generate dialogues that are not only human-like in their naturalness and coherence but also involve multiple talkers engaging in multiple rounds of conversation.
Researcher Affiliation	Collaboration	1Shanghai Jiao Tong University, China 2Microsoft, USA
Pseudocode	Yes	Algorithm 1 illustrates the dialogue data preparation pipeline, introduced in Section4.1.
Open Source Code	Yes	The data processing script, along with the model training and inference codes are publicly available 3. 3https://github.com/vivian556123/Neur IPS2024-Co Vo Mix.git
Open Datasets	Yes	The dataset used in this work is Fisher dataset [20], which is a telephone conversation dataset with 2,000h English conversations about various topics.
Dataset Splits	Yes	We randomly divide the Fisher dataset into train/valid/test sets with 97/1/2 split.
Hardware Specification	Yes	We use 8 NVIDIA TESLA V100 32GB GPUs for training.
Software Dependencies	No	The paper mentions several models and tools used (e.g., BERT, Hu BERT, Hi Fi-GAN, Wav LM-TDNN) and their corresponding references, but does not explicitly state specific version numbers for general software dependencies like Python, PyTorch, or other libraries used for implementation.
Experiment Setup	Yes	The text-to-semantic model is trained for 10 epochs with batch size 48. The acoustic model and duration model are trained for 100 epochs with batch size 64. We adopt Adam optimizer with 1e-4 learning rate. The probability of dropping condition during training is puncond = 0.3, and the strength of guidance is α = 0.7 during inference.