Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
SALMONN-omni: A Standalone Speech LLM without Codec Injection for Full-duplex Conversation
Authors: Wenyi Yu, Siyin Wang, Xiaoyu Yang, Xianzhao Chen, Xiaohai Tian, Jun Zhang, Guangzhi Sun, Lu Lu, Yuxuan Wang, Chao Zhang
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on widely used benchmarks for spoken question answering and open-domain dialogue show that SALMONN-omni achieves at least 30% relative performance improvement over existing open-source fullduplex models and performs highly competitively to half-duplex and turn-based systems, despite using substantially less training data. |
| Researcher Affiliation | Collaboration | 1Tsinghua University 2Byte Dance 3University of Cambridge EMAIL EMAIL |
| Pseudocode | No | The paper describes the methodology in narrative text and figures, but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | Some demo conversations between user and SALMONN-omni are provided in the following repository https://github.com/bytedance/SALMONN. NeurIPS Paper Checklist, Question 5: Does the paper provide open access to the data and code...? Answer: [No] Justification: We ll open source the code once accepted. |
| Open Datasets | Yes | For ASR task, Libri Speech-960h [48] corpus and Giga Speech-M [49] with 480k training samples are utilized. For spoken QA task, we gather questions from a variety of sources, including Alpaca-52k [50], Web Questions [51], Trivia QA [52], SQu AD [53], Natural Questions [54], Voice Assistant-400K from Mini-Omni [14] and Ultra Chat from SLAM-Omni [55]. |
| Dataset Splits | No | The paper lists various datasets used for training and evaluation along with the number of samples for some tasks (e.g., 'Giga Speech-M with 480k training samples', '730k QA samples', '80k multi-round conversation samples'). However, it does not provide specific training, validation, and test splits (e.g., percentages, exact counts, or references to predefined splits) for these datasets. |
| Hardware Specification | Yes | All training processes are performed on 32 A100 GPUs with 50k and 30k steps for stage 1 and 2. |
| Software Dependencies | Yes | Our streaming speech synthesizer is finetuned based on Cosy Voice2-0.5B [46]. |
| Experiment Setup | Yes | We use Llama-3-8B-Instruct as the LLM backbone and use Lo RA [47] with a rank of 32 and a scaling factor of 1.0 when finetuning the LLM backbone. Our streaming speech synthesizer is finetuned based on Cosy Voice2-0.5B [46]. We set 80 milliseconds (ms) as the time block size and the model generates one textual token after listening to 80 ms of audio. In the first two stages, the batch size is set to 128 and the learning rates are 4 105 and 3 105 respectively. In the third stage, batch sizes include 128, 256 and 512 are compared with learning rate set to 1 106. All training processes are performed on 32 A100 GPUs with 50k and 30k steps for stage 1 and 2. |