Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
VITA-Audio: Fast Interleaved Audio-Text Token Generation for Efficient Large Speech-Language Model
Authors: Zuwei Long, Yunhang Shen, Chaoyou Fu, Heting Gao, Lijiang Li, Peixian Chen, Mengdan Zhang, Hang Shao, Jian Li, Jinlong Peng, Haoyu Cao, Ke Li, Rongrong Ji, Xing Sun
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 4 Experiment 4.1 Experiment Settings We use the Qwen2.5-7B-Instruct [45] as the pre-trained text LLM. The initial version of VITA-Audio utilizes the speech tokenizer and speech decoder in GLM-4-Voice [66], which effectively captures semantic information at an ultra-low bitrate. In the second version, i.e., VITA-Audio-Plus further replaces the GLM-4-Voice tokenizer with Sense Voice Small [1] and an MLP-based adapter. The detail comparison between VITA-Audio and VITA-Audio-Plus is listed in Table E2. 4.2 Evaluation on Spoken Question Answering We evaluate the spoken question answering capability of VITA-Audio on three public English datasets: Web-Questions [5], Llama-Question [42], and Trivia QA [33]. Two evaluation methods are employed: S T, where the text responses generated by the model are evaluated directly, and S S, where the model s speech responses are transcribed using Whisper [46] before evaluation. Table 2: Results on Spoken Question Answering (SQA) benchmarks. Sx denotes the x-th training stage of speech models. ... 4.3 Evaluation on Fundamental Speech Competence TTS We evaluate the TTS performance of VITA-Audio on Seed-TTS [2] and Libri TTS [65] benchmarks. ... ASR We evaluate the ASR performance of the four stages of VITA-Audio on Wenet Speech [67], AIshell [6], Libri Speech [43], and Fleurs [15], and a subset of the results are reported in Table 4. ... 4.4 Evaluation of Latency Inference Speedup Efficient mapping between text and speech is the core of VITA-Audio. To demonstrate its effectiveness, we compare the inference time across different modes of VITA-Audio for various model sizes. Specifically, we evaluate the inference speed on GPUs capable of 148 TFLOPS under bfloat16 precision, with the output fixed at 4096 tokens, and record the total time as the model s inference time. |
| Researcher Affiliation | Collaboration | Zuwei Long1, , Yunhang Shen1, , , Chaoyou Fu2, , Heting Gao1, Lijiang Li2 Peixian Chen1, Mengdan Zhang1, Hang Shao1, Jian Li1, Jinlong Peng1 Haoyu Cao1, Ke Li1, Rongrong Ji3, Xing Sun1, 1Tencent Youtu Lab, 2Nanjing University, 3Xiamen University |
| Pseudocode | No | The paper describes the methodology in prose and uses diagrams (e.g., Figure 2 for architecture overview) to illustrate components and data flow. No explicit pseudocode or algorithm blocks are present. |
| Open Source Code | Yes | https://github.com/VITA-MLLM/VITA-Audio and "We fully release VITA-Audio to the open-source community." |
| Open Datasets | Yes | VITA-Audio is trained exclusively on open-source datasets, integrating multi-domain and multi-language speech data resources. The training dataset encompasses a diverse range of sources. Detailed descriptions of the datasets used at each stage are provided in Table E1 in the Appendix. ASR Data We aggregated approximately 100, 000 hours of open-source Automatic Speech Recognition (ASR) data, including Wenet Speech [67], Librispeech [43], Multilingual Libri Speech [44], Common Voice 17 [43], MMCRSC [40], Giga Speech [8], People s Speech [27], Vox Populi [53], and the AISHELL series (AISHELL-1 [6] to AISHELL-4 [26]). |
| Dataset Splits | Yes | We evaluate the spoken question answering capability of VITA-Audio on three public English datasets: Web-Questions [5], Llama-Question [42], and Trivia QA [33]... We evaluate the TTS performance of VITA-Audio on Seed-TTS [2] and Libri TTS [65] benchmarks... We evaluate the ASR performance of the four stages of VITA-Audio on Wenet Speech [67], AIshell [6], Libri Speech [43], and Fleurs [15] |
| Hardware Specification | No | Specifically, we evaluate the inference speed on GPUs capable of 148 TFLOPS under bfloat16 precision, with the output fixed at 4096 tokens, and record the total time as the model s inference time. |
| Software Dependencies | No | We use Transformers [57] and Flash Attention-2 [17]. |
| Experiment Setup | Yes | We use the Qwen2.5-7B-Instruct [45] as the pre-trained text LLM. All training data are uniformly packed into sequences of fixed length (8K tokens), an approach that enables effective training on samples of varying lengths [47]. We initialize the MCTP module using parameters from the final layer of the LLM, with the gradient detached from the LLM. To optimize training effectiveness, different learning rates are used for the MCTP module and the main LLM. |