Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

JavisGPT: A Unified Multi-modal LLM for Sounding-Video Comprehension and Generation

Authors: Kai Liu, Jungang Li, Yuchong Sun, Shengqiong Wu, jianzhang gao, Daoan Zhang, Wei Zhang, Sheng Jin, Sicheng Yu, Geng Zhan, Jiayi Ji, Fan Zhou, Liang Zheng, Shuicheng Yan, Hao Fei, Tat-Seng Chua

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental This section presents the experimental results to evaluate the effectiveness of our proposed Javis GPT. For a fair comparison, we train Javis GPT following the three-stage pipeline described in Sec. 4, with each stage running for one epoch. We use zero-shot evaluation for downstream tasks including audio-video comprehension and generation using their official evaluation protocols. Additional implementation and training details can be found in Sec. B.
Researcher Affiliation Academia 1ZJU, 2NUS, 3HKUST(GZ), 4RUC, 5UR, 6HZCU, 7NTU, 8SMU, 9USYD, 10ANU *Equal contribution. Work done during Kai Liu s visiting period at NUS. Email: EMAIL Corresponding author. Email: EMAIL
Pseudocode No The paper describes its architecture (Figure 2, 3, 4) and training strategies (Section 4), but it does not include a clearly labeled pseudocode or algorithm block.
Open Source Code Yes Additionally, we include an anonymous code repository for reproducibility: https://anonymous.4open.science/r/Javis GPT. We provide open access to both the code and data via an anonymous repository at https://anonymous.4open.science/r/Javis GPT.
Open Datasets Yes To facilitate instruction tuning, we construct Javis Inst-Omni, a large-scale, diverse, and high-quality instruction dataset covering both comprehension and generation. The dataset contains 200K dialogue trajectories involving tightly interleaved text, audio, and video, simulating complex singleand multi-turn interactions. All samples are annotated via GPT-4o and manually verified for quality. We provide open access to both the code and data via an anonymous repository at https://anonymous.4open.science/r/Javis GPT. For audio-video generation evaluation, we follow the protocol of Javis Di T [39], using 1,000 text-to-audio-video samples from Javis Bench-mini [39] for a comprehensive assessment.
Dataset Splits Yes To facilitate instruction tuning, we construct Javis Inst-Omni, a large-scale, diverse, and high-quality instruction dataset covering both comprehension and generation. The dataset contains 200K dialogue trajectories involving tightly interleaved text, audio, and video, simulating complex singleand multi-turn interactions. We use 600K audio-text pairs to enhance the audio understanding ability. ... We use 1.5M audio-video-caption triples from TAVGBench [46] to pretrain the generation component... We use 450K audio-video-text triplets from TAVGBench [46]... The same 450K triplets are reused for text-to-sounding-video generation... We first construct the Javis Inst-Und subset, a synchrony-aware audio-video QA dataset ... It consists of 110K samples... We also incorporate 95K audio-video understanding samples from Video LLa MA2 [14], including training splits from AVQA [89], Music AVQA [41], and AVSD [3]. To prevent catastrophic forgetting, we additionally include 20K image understanding samples from LLa VA-One Vision [32], 60K video understanding samples from LLa VA-Video-178K [95], 550K audio comprehension samples from Stage I s dataset, and 20K audio-video caption samples from TAVGBench [46]... 150K audio-video generation instances are also included... Data Collection. We use GPT-4o to construct 100 multi-turn dialogue samples based on the four scenarios illustrated in Fig. A3, including Gen2Und, Und2Gen, Proactive, and Rethink (25 samples each).
Hardware Specification Yes All models are trained on 8 NVIDIA A100-80GB GPUs.
Software Dependencies No The paper mentions using specific models and frameworks like Qwen2.5-VL-7B-Instruct [5], Qwen2.5 [87], BEATs [10], and Javis Di T-v0.1 [39], but it does not specify versions for underlying software like Python, PyTorch, or CUDA.
Experiment Setup Yes To effectively adapt the vision-language backbone to JAV tasks, we design a three-stage training pipeline: (1) MM-Pre Train: introducing the audio input branch for comprehension and preliminarily aligning the output embeddings of LLM with the condition space of Javis Di T; (2) AVFine Tune: enhancing synchronized audio-video comprehension and generation via Sync Fusion and hierarchical Javis Queries, respectively; and (3) MM-Inst Tune: enables generalizable instruction following and multimodal reasoning via large-scale instruction-tuning, to progressively build multimodal comprehension and generation from existing vision-language models. Lo RA [27] (r = 128, α = 256) is integrated with the LLM backbone to strengthen the adaptation. We set the learning rate to 1e-3 and train for one epoch. ... The learning rate is set to 1e-4, and training runs for one epoch. Table A1: Detailed settings for progressive audio-video-synchronized training. Warm-up Epochs 0.03 Batch Size 256 64 64 Weight Decay 0.0