Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

StreamFlow: Streaming Audio Generation from Discrete Tokens via Streaming Flow Matching

Authors: Ha-Yeong Choi, Sang-Hoon Lee

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We validate the effectiveness of Stream Flow through audio reconstruction tasks using discrete tokens from En Codec and Mimi, demonstrating both high-fidelity synthesis and streaming capability. Furthermore, we successfully incorporated our model into fully-duplex streaming speech language models of Moshi by replacing the Mimi decoder. Audio samples are available at https://streamflow25.github.io/demo/. 4 Experiment and Result We compared the performance of En Codec token reconstruction with the strong parallel baselines including Vocos, Multi-Band Diffusion (MBD), and RFWave. Furthermore, we compare with the streaming baselines including causal En Codec with the same latency to obtain robust streaming generation. Table 1 demonstrated the effectiveness of our methods in that streaming models still outperformed the powerful parallel models including MBD and RFWave. Furthermore, Stream Flow-T also shows better performance in terms of PESQ. 4.3 Ablation Study In-Context Learning When we prepend the generated samples as prompts, the models could learn useful information from them. Table 2 also showed much better performance than the model without prompt even with the same hyperparameter.
Researcher Affiliation	Collaboration	Ha-Yeong Choi1 Sang-Hoon Lee2,3 1Gen AI Lab, KT Corp., Seoul, Korea 2Department of Software and Computer Engineering, Ajou University, Suwon, Korea 3Department of Artificial Intelligence, Ajou University, Suwon, Korea
Pseudocode	No	The paper describes methods and processes through narrative text, mathematical equations, and architectural diagrams (Figures 3 and 4), but it does not include any explicitly labeled pseudocode or algorithm blocks with structured, code-like steps.
Open Source Code	No	We also plan to release the code and model after the paper is accepted. Even though the full code is not yet provided, the detailed architecture is illustrated in figures to enable reproducibility. We will release all source code and checkpoints after paper notification.
Open Datasets	Yes	Dataset We utilized Libri TTS [60] to train all models including the ablation study. Libri TTS is a high-quality speech dataset with a sampling rate of 24,000 Hz. We used En Codec and Mimi as speech tokenizers to compare the streaming reconstruction performance because they consist of causal convolutional layers for encoding and decoding the waveform signal.
Dataset Splits	Yes	We first validate the models using Libri TTS-dev clean and test subsets, and then we evaluate the performance of each model using universal speech datasets consisting of 300 samples from various datasets including Expresso, Hi Fi TTS, Libri TTS, Aishell3, JVS, and CML-TTS following RFWave [32].
Hardware Specification	Yes	Training For streaming models, we pre-train Stream Flow models with a learning rate of 2 10 4, batch size of 512 for 1M steps on four NVIDIA A6000 GPUs. We utilize sliced window training by randomly segmenting the waveform signal by 10,240 frames (32 tokens of En Codec). Then, we fine-tune Stream Flow with a learning rate of 2 10 5, batch size of 64 for 0.25M steps on four NVIDIA A6000 GPUs.
Software Dependencies	No	The paper mentions several software components, including 'PyTorch' (implicitly through 'auraloss: Audio focused loss functions in pytorch' and 'pytorch implementation of CREPE'), 'Auraloss' [47], and 'CREPE' [17]. However, it does not specify explicit version numbers for any of these software dependencies or a programming language version.
Experiment Setup	Yes	Training For streaming models, we pre-train Stream Flow models with a learning rate of 2 10 4, batch size of 512 for 1M steps on four NVIDIA A6000 GPUs. We utilize sliced window training by randomly segmenting the waveform signal by 10,240 frames (32 tokens of En Codec). Then, we fine-tune Stream Flow with a learning rate of 2 10 5, batch size of 64 for 0.25M steps on four NVIDIA A6000 GPUs. We utilize sliced window training by randomly segmenting the waveform signal by 20,480 frames (64 tokens of En Codec). The architecture details are described in Appendix A. The parallel models are described in Appendix B. Sampling We utilize the Euler method as the ODE method. We compared the sampling steps for pre-trained models in Table 9. For streaming models, we fine-tuned the model with the fixedstep generator by four steps of parallel models and eight steps of streaming models. For Mimi reconstruction, we fixed two steps for minimal latency.