Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Shallow Flow Matching for Coarse-to-Fine Text-to-Speech Synthesis
Authors: Dong Yang, YIYI CAI, Yuki Saito, Lixu Wang, Hiroshi Saruwatari
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments demonstrate that SFM yields consistent gains in speech naturalness across both objective and subjective evaluations, and significantly accelerates inference when using adaptive-step ODE solvers. Demo and codes are available at https://ydqmkkx.github.io/SFMDemo/. |
| Researcher Affiliation | Academia | Dong Yang1*, Yiyi Cai2, Yuki Saito1, Lixu Wang3, Hiroshi Saruwatari1 1The University of Tokyo, 2Independent Researcher, 3Nanyang Technological University |
| Pseudocode | Yes | We provide concise algorithm boxes in Appendix C, using minimal notation for clarity and including implementation details. |
| Open Source Code | Yes | Demo and codes are available at https://ydqmkkx.github.io/SFMDemo/. |
| Open Datasets | Yes | We use LJ Speech [17], VCTK [44], and Libri TTS [50] in our experiments, where LJ Speech is a single-speaker dataset and the others are multi-speaker datasets. |
| Dataset Splits | Yes | For LJ Speech and VCTK, the training, validation, and test sets are divided following the setting of Matcha-TTS, which follows VITS s settings.1 Each validation set contains 100 utterances, and each test set contains 500 utterances. For Libri TTS, the train subsets are used as the training set. We construct the validation and test sets with the dev-clean and test-clean subsets, respectively. We set the validation and test sets to contain 200 and 1000 utterances, respectively. |
| Hardware Specification | Yes | All training, inference, and objective evaluations are conducted on 96 GB Nvidia H100 GPUs with half precision (FP16). |
| Software Dependencies | No | Although we use publicly available datasets and provide our code in supplemental material, our experiment involves multiple models whose environment settings need to be found on their official Git Hub pages. (from Checklist Question 5 justification) |
| Experiment Setup | Yes | All training, inference, and objective evaluations are conducted on 96 GB Nvidia H100 GPUs with half precision (FP16). We follow the official configurations of each baseline as closely as possible, and some key settings are summarized in Table 1. Specifically, a warmup parameter indicates that a learning rate scheduler is used. Because of our large batch sizes, we increase the constant or peak learning rates based on the linear scaling rule, and then reduce them if gradient explosions are observed. The presence of a CFG strength indicates the application of CFG. |