Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

SALMONN-omni: A Standalone Speech LLM without Codec Injection for Full-duplex Conversation

Authors: Wenyi Yu, Siyin Wang, Xiaoyu Yang, Xianzhao Chen, Xiaohai Tian, Jun Zhang, Guangzhi Sun, Lu Lu, Yuxuan Wang, Chao Zhang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on widely used benchmarks for spoken question answering and open-domain dialogue show that SALMONN-omni achieves at least 30% relative performance improvement over existing open-source fullduplex models and performs highly competitively to half-duplex and turn-based systems, despite using substantially less training data.
Researcher Affiliation Collaboration 1Tsinghua University 2Byte Dance 3University of Cambridge EMAIL EMAIL
Pseudocode No The paper describes the methodology in narrative text and figures, but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No Some demo conversations between user and SALMONN-omni are provided in the following repository https://github.com/bytedance/SALMONN. NeurIPS Paper Checklist, Question 5: Does the paper provide open access to the data and code...? Answer: [No] Justification: We ll open source the code once accepted.
Open Datasets Yes For ASR task, Libri Speech-960h [48] corpus and Giga Speech-M [49] with 480k training samples are utilized. For spoken QA task, we gather questions from a variety of sources, including Alpaca-52k [50], Web Questions [51], Trivia QA [52], SQu AD [53], Natural Questions [54], Voice Assistant-400K from Mini-Omni [14] and Ultra Chat from SLAM-Omni [55].
Dataset Splits No The paper lists various datasets used for training and evaluation along with the number of samples for some tasks (e.g., 'Giga Speech-M with 480k training samples', '730k QA samples', '80k multi-round conversation samples'). However, it does not provide specific training, validation, and test splits (e.g., percentages, exact counts, or references to predefined splits) for these datasets.
Hardware Specification Yes All training processes are performed on 32 A100 GPUs with 50k and 30k steps for stage 1 and 2.
Software Dependencies Yes Our streaming speech synthesizer is finetuned based on Cosy Voice2-0.5B [46].
Experiment Setup Yes We use Llama-3-8B-Instruct as the LLM backbone and use Lo RA [47] with a rank of 32 and a scaling factor of 1.0 when finetuning the LLM backbone. Our streaming speech synthesizer is finetuned based on Cosy Voice2-0.5B [46]. We set 80 milliseconds (ms) as the time block size and the model generates one textual token after listening to 80 ms of audio. In the first two stages, the batch size is set to 128 and the learning rates are 4 105 and 3 105 respectively. In the third stage, batch sizes include 128, 256 and 512 are compared with learning rate set to 1 106. All training processes are performed on 32 A100 GPUs with 50k and 30k steps for stage 1 and 2.