Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
MoonCast: High-Quality Zero-Shot Podcast Generation
Authors: Zeqian Ju, Dongchao Yang, Shen Kai, YICHONG LENG, Zhengtao Wang, Songxiang Liu, Xinyu Zhou, Tao Qin, Xiangyang Li, Jianwei Yu, Xu Tan
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 5 Experiments and Results In this section, we present a overview of the experimental setup, including detailed descriptions of the data preparation, the model architecture, and the evaluation setting. We report the evaluation results of the Chinese and English podcast generation in Table 1 and 2. We make the following observations: 1) Moon Cast consistently surpasses the two concatenation baselines in terms of spontaneity, coherence, intelligibility and quality metrics for both Chinese and English podcast generation. |
| Researcher Affiliation | Collaboration | 1University of Science and Technology of China 2Moonshot AI 3The Chinese University of Hongkong 4Microsoft Research. |
| Pseudocode | No | The paper only describes the methodology and architecture in prose and diagrams (e.g., Figure 1 and Figure 2), without presenting any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | We open-source Moon Cast, including the prompts2 for script generation and the audio modeling module3 for speech generation, to support future research. [...] 3https://github.com/jzq2000/Moon Cast |
| Open Datasets | No | We conduct our experiments on a large-scale internal Chinese and English audio dataset comprising approximately 1.0 million hours of audio from diverse sources, including podcasts, audiobooks, and audio clips from shows. The final dataset comprises 300,000 hours from Chinese audiobook sources, 15,000 hours from Chinese conversational sources, and 200,000 hours from English conversational sources. |
| Dataset Splits | No | We conduct our experiments on a large-scale internal Chinese and English audio dataset comprising approximately 1.0 million hours of audio from diverse sources, including podcasts, audiobooks, and audio clips from shows. The final dataset comprises 300,000 hours from Chinese audiobook sources, 15,000 hours from Chinese conversational sources, and 200,000 hours from English conversational sources. For podcast generation, we curate an evaluation dataset comprising two knowledge sources in PDF format and two in web URL format, encompassing domains such as computer science papers4, economics papers5, technology blogs6, and news articles7. For both datasets, we use 3-10 seconds of speech as the prompt for each speaker. The paper specifies the composition of its dataset and what is used for evaluation, but does not provide explicit training, validation, and test splits for the model's training. |
| Hardware Specification | Yes | We train it using the Megatron framework on 64 A100 80GB GPUs with a tensor parallelism degree of 8, over a maximum sequence length of 40k, a batch size of 600, and for 2,000 steps in each curriculum learning stage. |
| Software Dependencies | No | The paper mentions several toolkits and models like 'torchdyn toolkit', 'Fun ASR', 'Ne Mo ASR toolkit', 'Pyannotate toolkit', 'Paraformer ASR model', and 'DNSMOS toolkit' but does not provide specific version numbers for any of them. |
| Experiment Setup | Yes | We train it using the Megatron framework on 64 A100 80GB GPUs with a tensor parallelism degree of 8, over a maximum sequence length of 40k, a batch size of 600, and for 2,000 steps in each curriculum learning stage. We use a top-k value of 30, a top-p value of 0.8, and a temperature of 0.8 for inference. We use Byte-Pair Encoding (BPE) for text tokenization. [...] For the speech semantic codec, both the encoder and decoder consist of 12 Conv Next blocks, each with a kernel size of 7 and a hidden size of 384. The 1024-dimensional SSL feature is projected into an 8-dimensional space for quantization using an 8192-entry codebook. We train the codec for 200,000 steps. [...] For the speech detokenizer, we adopt a 0.8B-parameter, 10-layer Dit-style Transformer with a hidden size of 2048 and 16 attention heads. During training, the chunk size is dynamically set between 0.5 and 3 seconds to support flexible inference. For inference, we specifically use a chunk size of 3 seconds to achieve better quality. The backward ODE for each chunk is solved using 30 steps with the torchdyn toolkit [Poli et al.]. In addition, we adopt a 250M-parameter Big VGAN [Lee et al., 2022] to reconstruct waveforms from mel-spectrograms. |