Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

StreamFlow: Streaming Audio Generation from Discrete Tokens via Streaming Flow Matching

Authors: Ha-Yeong Choi, Sang-Hoon Lee

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We validate the effectiveness of Stream Flow through audio reconstruction tasks using discrete tokens from En Codec and Mimi, demonstrating both high-fidelity synthesis and streaming capability. Furthermore, we successfully incorporated our model into fully-duplex streaming speech language models of Moshi by replacing the Mimi decoder. Audio samples are available at https://streamflow25.github.io/demo/. 4 Experiment and Result We compared the performance of En Codec token reconstruction with the strong parallel baselines including Vocos, Multi-Band Diffusion (MBD), and RFWave. Furthermore, we compare with the streaming baselines including causal En Codec with the same latency to obtain robust streaming generation. Table 1 demonstrated the effectiveness of our methods in that streaming models still outperformed the powerful parallel models including MBD and RFWave. Furthermore, Stream Flow-T also shows better performance in terms of PESQ. 4.3 Ablation Study In-Context Learning When we prepend the generated samples as prompts, the models could learn useful information from them. Table 2 also showed much better performance than the model without prompt even with the same hyperparameter.
Researcher Affiliation Collaboration Ha-Yeong Choi1 Sang-Hoon Lee2,3 1Gen AI Lab, KT Corp., Seoul, Korea 2Department of Software and Computer Engineering, Ajou University, Suwon, Korea 3Department of Artificial Intelligence, Ajou University, Suwon, Korea
Pseudocode No The paper describes methods and processes through narrative text, mathematical equations, and architectural diagrams (Figures 3 and 4), but it does not include any explicitly labeled pseudocode or algorithm blocks with structured, code-like steps.
Open Source Code No We also plan to release the code and model after the paper is accepted. Even though the full code is not yet provided, the detailed architecture is illustrated in figures to enable reproducibility. We will release all source code and checkpoints after paper notification.
Open Datasets Yes Dataset We utilized Libri TTS [60] to train all models including the ablation study. Libri TTS is a high-quality speech dataset with a sampling rate of 24,000 Hz. We used En Codec and Mimi as speech tokenizers to compare the streaming reconstruction performance because they consist of causal convolutional layers for encoding and decoding the waveform signal.
Dataset Splits Yes We first validate the models using Libri TTS-dev clean and test subsets, and then we evaluate the performance of each model using universal speech datasets consisting of 300 samples from various datasets including Expresso, Hi Fi TTS, Libri TTS, Aishell3, JVS, and CML-TTS following RFWave [32].
Hardware Specification Yes Training For streaming models, we pre-train Stream Flow models with a learning rate of 2 10 4, batch size of 512 for 1M steps on four NVIDIA A6000 GPUs. We utilize sliced window training by randomly segmenting the waveform signal by 10,240 frames (32 tokens of En Codec). Then, we fine-tune Stream Flow with a learning rate of 2 10 5, batch size of 64 for 0.25M steps on four NVIDIA A6000 GPUs.
Software Dependencies No The paper mentions several software components, including 'PyTorch' (implicitly through 'auraloss: Audio focused loss functions in pytorch' and 'pytorch implementation of CREPE'), 'Auraloss' [47], and 'CREPE' [17]. However, it does not specify explicit version numbers for any of these software dependencies or a programming language version.
Experiment Setup Yes Training For streaming models, we pre-train Stream Flow models with a learning rate of 2 10 4, batch size of 512 for 1M steps on four NVIDIA A6000 GPUs. We utilize sliced window training by randomly segmenting the waveform signal by 10,240 frames (32 tokens of En Codec). Then, we fine-tune Stream Flow with a learning rate of 2 10 5, batch size of 64 for 0.25M steps on four NVIDIA A6000 GPUs. We utilize sliced window training by randomly segmenting the waveform signal by 20,480 frames (64 tokens of En Codec). The architecture details are described in Appendix A. The parallel models are described in Appendix B. Sampling We utilize the Euler method as the ODE method. We compared the sampling steps for pre-trained models in Table 9. For streaming models, we fine-tuned the model with the fixedstep generator by four steps of parallel models and eight steps of streaming models. For Mimi reconstruction, we fixed two steps for minimal latency.