Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Audio Super-Resolution with Latent Bridge Models

Authors: Chang Li, Zehua Chen, Liyuan Wang, Jun Zhu

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Comprehensive experimental results evaluated on the VCTK, ESC-50, Song-Describer benchmark datasets and two internal testsets demonstrate that we achieve state-of-the-art objective and perceptual quality for any-to-48k Hz SR across speech, audio, and music signals, as well as setting the first record for any-to-192k Hz audio SR. ... 4 Experiment In this section, we first describe the experimental setups and then present the experimental results of any-to-48 k Hz upsampling and upsampling beyond 48 k Hz with an in-depth analysis.
Researcher Affiliation Collaboration 1Department of CST, Tsinghua University, Beijing, China 2Shengshu AI, Beijing, China 3USTC, Hefei, China
Pseudocode No The paper describes methods using mathematical formulations (e.g., Eq. 1, 2, 3, 5, 8) and textual descriptions, but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No Justification: The authors will provide detailed implementation specifics in the supplementary materials, but the code is not available at this stage.
Open Datasets Yes Comprehensive experimental results evaluated on the VCTK, ESC-50, Song-Describer benchmark datasets and two internal testsets demonstrate that we achieve state-of-the-art objective and perceptual quality for any-to-48k Hz SR across speech, audio, and music signals, as well as setting the first record for any-to-192k Hz audio SR. ... VCTK [118] test set, 300 audio samples from the ESC-50 fold-5 [79] and 300 music samples from the Song-Describer-Dataset (SDS) [68]. ... Table 15: Overview of training datasets used at different sampling rates, including both public and internal sources. VCTK-train [118] ... Open SLR [43] ... EARS [86] ... Expresso [75] ... Mus DB18 [82] ... Medleydb [6] ... FSD50K [29]
Dataset Splits Yes For the any-to-48 k Hz task, we randomly sample 500 speech clips from the VCTK [118] test set, 300 audio samples from the ESC-50 fold-5 [79] and 300 music samples from the Song-Describer-Dataset (SDS) [68]. ... To evaluate performance on native 48 k Hz content, we additionally use 300 randomly selected clips from our internal 48 k Hz dataset (48Audio). For the 96 k Hz and 192 k Hz settings, we select 300 audio clips (96/192 Audio) and 300 music (96/192 Music) excerpts from our internal dataset. Each 192 k Hz clip is 2.56 seconds long, while all other evaluation samples are 5.12 seconds. ... All recordings with an original sampling rate below 32 k Hz are filtered out, resulting in a total of approximately 5,000 hours of training data.
Hardware Specification Yes We report the real-time factor (RTF) on an NVIDIA-A800 and several baselines under the 48 k Hz setting in Table 12.
Software Dependencies No The paper mentions several software components like 'librosa [70]' and frameworks like 'DiT [78, 4]-based noise prediction network' and 'Adam optimizer', but it does not specify exact version numbers for these software dependencies or libraries used in their implementation.
Experiment Setup Yes Training setup. We train our compression and SR models on a mixed corpus comprising speech, audio, and music data. All recordings with an original sampling rate below 32 k Hz are filtered out, resulting in a total of approximately 5,000 hours of training data. The detailed dataset information is summarized in Appendix G. For each SR stage, all data are resampled to the corresponding target sampling rate and randomly cropped for 5.12-second for the any-to-48 k Hz and 48 96 k Hz models and 2.56-second segments for the 96 192 k Hz upsampling stage. ... G.2 Model architecture: We adopt the Diffusion Transformer (Di T) architecture as the noise predictor for our bridge model, following the design of Stable Audio Open [28]. ... Any-to-48 k Hz Stage: Depth: 24 layers Attention heads: 24 Hidden dimension: 1152 Scaling factor: 0.25 Training sample length: 245760 samples (5.12s) Training: Model: Batch size: 128 Optimizer: Adam with β1 = 0.9, β2 = 0.99 Learning rate: 1 10 5 Weight decay: 0 Training sample rate range: 2 k Hz to 32 k Hz Bridge g2 max = 1.0 Bridge g2 min = 0.001