Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction

Authors: Chaoyou Fu, Haojia Lin, Xiong Wang, yifan zhang, Yunhang Shen, Xiaoyu Liu, Haoyu Cao, Zuwei Long, Heting Gao, Ke Li, Long MA, Xiawu Zheng, Rongrong Ji, Xing Sun, Caifeng Shan, Ran He

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We have conducted extensive evaluations on various benchmarks related to image, video, and speech understanding, comparing the results with both open-source and proprietary models. VITA-1.5 demonstrates comparable perception and reasoning capabilities comparable to leading image/video based MLLMs, and shows significant improvements in the speech capability.
Researcher Affiliation Collaboration 1State Key Laboratory for Novel Software Technology, Nanjing University 2School of Intelligence Science and Technology, Nanjing University 3Tencent Youtu Lab, 4XMU, 5CASIA Project Leader Corresponding Author
Pseudocode No The paper describes the model architecture and training strategies in sections 3.1, 3.2, and 3.3 using descriptive text and figures, but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Source Code: https://github.com/VITA-MLLM/VITA
Open Datasets Yes As shown in Table 1, the training data of multimodal instruction tuning encompass a wide range of categories, such as caption data and QA data, both Chinese and English. During different training phases, subsets of the overall dataset are selectively sampled to serve different objectives. Specifically, the datasets are categorized as follows: Image Captioning Data. Datasets such as Share GPT4V [42], ALLa VA-Caption [43], Shared GPT4o-Image2, and synthetic data are used... The images of the synthetic data come from open-source datasets like Wukong [51], LAION [52], and CC12M [53].
Dataset Splits Yes Stage 1.1 Vision Alignment. ... We use 20% of the descriptive caption data from Table 1 for training... Stage 1.2 Vision Understanding. ... we use all the descriptive caption data from Table 1... Stage 1.3 Vision SFT. ... we use all the QA data from Table 1 while retaining 20% of the descriptive caption data... Stage 2.2 Audio SFT. ... we sample 4% of the caption data and 20% of the QA data from Table 1.
Hardware Specification No The paper discusses the results of evaluations and mentions the high computational cost of training from scratch, but it does not specify the exact hardware used for training or inference, such as specific GPU/CPU models or memory amounts.
Software Dependencies No The paper refers to various models and frameworks (e.g., LLaMA, InternViT, Ti Codec) but does not provide specific version numbers for software dependencies or libraries used for implementation (e.g., PyTorch, TensorFlow, Python versions).
Experiment Setup Yes In this paper, we introduce VITA-1.5, a multimodal LLM that integrates vision, language, and speech through a carefully designed three-stage training methodology. The training strategy progressively incorporates vision and speech data, relieving modality conflicts while maintaining strong multimodal performance. In the first stage, we focus on vision-language by training visual adapters and finetuning the model with descriptive caption and visual QA data... Visual Encoder. VITA-1.5 adopts Intern Vi T-300M1 as the visual encoder, with an input image size of 448 × 448 pixels, generating 256 visual tokens per image. ... Speech Encoder. Similar to [40], our audio encoding module consists of multiple downsampling convolutional layers (4x downsampling) and 24 Transformer blocks (with a hidden size of 1024).