Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
video-SALMONN-o1: Reasoning-enhanced Audio-visual Large Language Model
Authors: Guangzhi Sun, Yudong Yang, Jimin Zhuang, Changli Tang, Yixuan Li, Wei Li, Zejun Ma, Chao Zhang
ICML 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | video-SALMONN-o1 achieves 3-8% accuracy improvements over the LLa VA-One Vision baseline across different video reasoning benchmarks. Besides, p DPO achieves 6-8% improvements compared to the supervised ο¬ne-tuning model on Riva Bench. |
| Researcher Affiliation | Collaboration | 1Tsinghua University 2Byte Dance 3University of Cambridge. Correspondence to: Chao Zhang <EMAIL>. |
| Pseudocode | No | The paper describes the proposed method, pDPO, using mathematical equations and textual explanations, but it does not present any structured pseudocode blocks or algorithms. |
| Open Source Code | Yes | 1Code available: https://github.com/Brians IDP/ video-SALMONN-o1. |
| Open Datasets | Yes | Following Tang et al. (2024a), the audio modality alignment stage employs Libri Speech-960h (Panayotov et al., 2015) ASR data and Audio Caps (Kim et al., 2019) audio caption data to train the audio aligner. |
| Dataset Splits | No | The paper mentions collecting "150k normal question-answer (QA) pairs" and "30k reasoning-intensive SFT QA pairs" for training and describes a "held-out validation set" and "Riva Bench" for evaluation. However, it does not provide specific percentages or counts for training/validation/test splits of its SFT data, nor does it reference standard splits for this internally generated data, making it difficult to reproduce the exact data partitioning. |
| Hardware Specification | Yes | SFT is performed on 16 A100 GPUs for 48 hours and p DPO is trained with 8 A100 GPUs for 24 hours. |
| Software Dependencies | No | The paper mentions using "Sig LIP (Zhai et al., 2023) visual encoder and Qwen 2 with 7B parameters backbone LLM" and "Whisper-Large-v3 encoder (Radford et al., 2023)". While these are specific models/frameworks, no version numbers are provided for the underlying software libraries (e.g., PyTorch, TensorFlow, CUDA) or general programming language versions used for implementation. |
| Experiment Setup | Yes | We set Lo RA hyper-parameters r = 64 and = 256 for the backbone LLM for both SFT and p DPO. During training, the visual encoder and aligner, audio encoder, and LLM remain frozen. SFT is performed on 16 A100 GPUs for 48 hours and p DPO is trained with 8 A100 GPUs for 24 hours. The model processes videos at a 2-frame-per-second rate with a maximum of 60 frames. ... producing 150 audio tokens for every 30 seconds. Greedy decoding is used during inference. |