Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

video-SALMONN-o1: Reasoning-enhanced Audio-visual Large Language Model

Authors: Guangzhi Sun, Yudong Yang, Jimin Zhuang, Changli Tang, Yixuan Li, Wei Li, Zejun Ma, Chao Zhang

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	video-SALMONN-o1 achieves 3-8% accuracy improvements over the LLa VA-One Vision baseline across different video reasoning benchmarks. Besides, p DPO achieves 6-8% improvements compared to the supervised ﬁne-tuning model on Riva Bench.
Researcher Affiliation	Collaboration	1Tsinghua University 2Byte Dance 3University of Cambridge. Correspondence to: Chao Zhang <EMAIL>.
Pseudocode	No	The paper describes the proposed method, pDPO, using mathematical equations and textual explanations, but it does not present any structured pseudocode blocks or algorithms.
Open Source Code	Yes	1Code available: https://github.com/Brians IDP/ video-SALMONN-o1.
Open Datasets	Yes	Following Tang et al. (2024a), the audio modality alignment stage employs Libri Speech-960h (Panayotov et al., 2015) ASR data and Audio Caps (Kim et al., 2019) audio caption data to train the audio aligner.
Dataset Splits	No	The paper mentions collecting "150k normal question-answer (QA) pairs" and "30k reasoning-intensive SFT QA pairs" for training and describes a "held-out validation set" and "Riva Bench" for evaluation. However, it does not provide specific percentages or counts for training/validation/test splits of its SFT data, nor does it reference standard splits for this internally generated data, making it difficult to reproduce the exact data partitioning.
Hardware Specification	Yes	SFT is performed on 16 A100 GPUs for 48 hours and p DPO is trained with 8 A100 GPUs for 24 hours.
Software Dependencies	No	The paper mentions using "Sig LIP (Zhai et al., 2023) visual encoder and Qwen 2 with 7B parameters backbone LLM" and "Whisper-Large-v3 encoder (Radford et al., 2023)". While these are specific models/frameworks, no version numbers are provided for the underlying software libraries (e.g., PyTorch, TensorFlow, CUDA) or general programming language versions used for implementation.
Experiment Setup	Yes	We set Lo RA hyper-parameters r = 64 and = 256 for the backbone LLM for both SFT and p DPO. During training, the visual encoder and aligner, audio encoder, and LLM remain frozen. SFT is performed on 16 A100 GPUs for 48 hours and p DPO is trained with 8 A100 GPUs for 24 hours. The model processes videos at a 2-frame-per-second rate with a maximum of 60 frames. ... producing 150 audio tokens for every 30 seconds. Greedy decoding is used during inference.