Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

video-SALMONN-o1: Reasoning-enhanced Audio-visual Large Language Model

Authors: Guangzhi Sun, Yudong Yang, Jimin Zhuang, Changli Tang, Yixuan Li, Wei Li, Zejun Ma, Chao Zhang

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental video-SALMONN-o1 achieves 3-8% accuracy improvements over the LLa VA-One Vision baseline across different video reasoning benchmarks. Besides, p DPO achieves 6-8% improvements compared to the supervised fine-tuning model on Riva Bench.
Researcher Affiliation Collaboration 1Tsinghua University 2Byte Dance 3University of Cambridge. Correspondence to: Chao Zhang <EMAIL>.
Pseudocode No The paper describes the proposed method, pDPO, using mathematical equations and textual explanations, but it does not present any structured pseudocode blocks or algorithms.
Open Source Code Yes 1Code available: https://github.com/Brians IDP/ video-SALMONN-o1.
Open Datasets Yes Following Tang et al. (2024a), the audio modality alignment stage employs Libri Speech-960h (Panayotov et al., 2015) ASR data and Audio Caps (Kim et al., 2019) audio caption data to train the audio aligner.
Dataset Splits No The paper mentions collecting "150k normal question-answer (QA) pairs" and "30k reasoning-intensive SFT QA pairs" for training and describes a "held-out validation set" and "Riva Bench" for evaluation. However, it does not provide specific percentages or counts for training/validation/test splits of its SFT data, nor does it reference standard splits for this internally generated data, making it difficult to reproduce the exact data partitioning.
Hardware Specification Yes SFT is performed on 16 A100 GPUs for 48 hours and p DPO is trained with 8 A100 GPUs for 24 hours.
Software Dependencies No The paper mentions using "Sig LIP (Zhai et al., 2023) visual encoder and Qwen 2 with 7B parameters backbone LLM" and "Whisper-Large-v3 encoder (Radford et al., 2023)". While these are specific models/frameworks, no version numbers are provided for the underlying software libraries (e.g., PyTorch, TensorFlow, CUDA) or general programming language versions used for implementation.
Experiment Setup Yes We set Lo RA hyper-parameters r = 64 and = 256 for the backbone LLM for both SFT and p DPO. During training, the visual encoder and aligner, audio encoder, and LLM remain frozen. SFT is performed on 16 A100 GPUs for 48 hours and p DPO is trained with 8 A100 GPUs for 24 hours. The model processes videos at a 2-frame-per-second rate with a maximum of 60 frames. ... producing 150 audio tokens for every 30 seconds. Greedy decoding is used during inference.