Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
CoVoMix2: Advancing Zero-Shot Dialogue Generation with Fully Non-Autoregressive Flow Matching
Authors: Leying Zhang, Yao Qian, Xiaofei Wang, Manthan Thakker, Dongmei Wang, Jianwei Yu, Haibin Wu, Yuxuan Hu, Jinyu Li, Yanmin Qian, sheng zhao
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments show that Co Vo Mix2 outperforms strong baselines, achieving state-of-the-art (SOTA) performance among open-source checkpoints, while requiring significantly less training data and delivering faster inference speed. 4 Experimental Setup 5 Result and Analysis |
| Researcher Affiliation | Collaboration | 1Shanghai Jiao Tong University, China 2Microsoft, USA |
| Pseudocode | No | The paper describes the methodology in detail, including design strategies, training processes, and components, but it does not present any explicitly labeled pseudocode or algorithm blocks. The main figure (Figure 1) is an overview diagram, not a pseudocode block. |
| Open Source Code | No | Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? answer: [No] Justification: We are not ready to open source the data and code at the current stage. We will consider this plan in the future. |
| Open Datasets | Yes | For the first training stage, we used the Libri Heavy dataset [39], comprising 60k hours of high-quality single-speaker audiobook-style recordings. In the second stage, in addition to the podcast dataset, we simulated dialogue-style data by concatenating utterances from different speakers using both Libri Heavy and Libri TTS [40]. In order to evaluate the model performance, we design a dialogue test set 8, containing 1000 dialogue transcriptions from Dailydialog [41] and the acoustic prompts are from Librispeech-test-clean [42]. |
| Dataset Splits | Yes | In order to evaluate the model performance, we design a dialogue test set 8, containing 1000 dialogue transcriptions from Dailydialog [41] and the acoustic prompts are from Librispeech-test-clean [42]. We also use samples from this dialogue dataset for subjective evaluation. |
| Hardware Specification | Yes | Training was conducted on 32 NVIDIA Tesla V100 GPUs (32GB) with gradient accumulation set to 4. During inference, we used a guidance strength Ξ± of 1.0 and performed sampling with 32 function evaluations (NFE) using an ODE solver. We measure the RTF on a single NVIDIA A100 machine. |
| Software Dependencies | No | First, we applied Microsoft speech enhancement API 4 to remove background noise and music. Second, we implemented automatic speech recognition and speaker diarization. The diarization results were used to assign speaker identities to each utterance with timestamps. Specifically, we used the Deepgram API [38]5 to obtain word-level transcriptions and speaker labels. A Big VGAN-based [44] vocoder was employed to convert the log mel-filter bank features into waveforms. We utilize Microsoft Fast Transcription API 9 as automatic speech recognition and diarization tool to transcribe the generated speech. We utilize Wav LM-TDNN [47] to extract the speaker embeddings. |
| Experiment Setup | Yes | In our experiments, the backbone architecture closely followed the configurations in [26]. Specifically, we used Transformer with 24 layers, 16 attention heads, and an embedding dimension of 1024 with U-Net [43] style skip connections. The Οmin is set to 0.1. We modeled the 100-dimensional log mel-filter bank features, extracted every 10.7 milliseconds from audio samples with a 24k Hz sampling rate. A Big VGAN-based [44] vocoder was employed to convert the log mel-filter bank features into waveforms. In addition, we implemented Classifier-Free Guidance (CFG) [37] with a dropout probability puncond = 20%, randomly removing conditioning during training. In the first training stage, we train the model on 60k hours Libri Heavy [39] dataset for 200k steps with peak learning rate(lr) of 7.5e-5. In the second training stage, we train it for another 200k steps on the combined podcast, audiobook, and simulated dialogue datasets with a peak lr of 5e-5. The model was optimized using the Adam optimizer. A linear-decay learning rate schedule was used in both stages. Each training batch contained two samples, each less than 30 seconds in duration. During inference, we used a guidance strength Ξ± of 1.0 and performed sampling with 32 function evaluations (NFE) using an ODE solver. |