Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

LLaVA-Video: Video Instruction Tuning With Synthetic Data

Authors: Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun MA, Ziwei Liu, Chunyuan Li

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments demonstrate that LLaVA-Video achieves strong performance across various video benchmarks, highlighting the effectiveness of our dataset. We conducted evaluations for the LLaVA-Video models across all benchmarks using LMMs-Eval (Zhang et al., 2024a) to ensure standardization and reproducibility. For full evaluation, we consdier 11 video benchmarks. conducted tests across various video captioning , video open-ended question-answering and video multiple-choice question-answering benchmarks.
Researcher Affiliation Collaboration Yuanhan Zhang EMAIL S-Lab, Nanyang Technological University Jinming Wu EMAIL BUPT Wei Li EMAIL Byte Dance Bo Li EMAIL S-Lab, Nanyang Technological University Zejun Ma EMAIL Byte Dance Ziwei Liu EMAIL S-Lab, Nanyang Technological University Chunyuan Li EMAIL Byte Dance
Pseudocode No The paper describes methods and pipelines in text and with diagrams (e.g., Figure 2 for the video detail description creation pipeline), but does not present any formal pseudocode or algorithm blocks.
Open Source Code Yes Open-Source: In an effort to support the development of general-purpose visual assistants, we release our multimodal instruction data, codebase, model checkpoints, and a visual chat demo to the public.
Open Datasets Yes Video-language Instruction-Following Data: We present a high-quality dataset LLaVA-Video-178K tailored for video instruction-following. It consists of 178K video with 1.3M instruction samples, including detailed captions, free-form and multiple-choice question answering. Open-Source: In an effort to support the development of general-purpose visual assistants, we release our multimodal instruction data, codebase, model checkpoints, and a visual chat demo to the public. We fine-tune LLaVA-One Vision (SI) on the joint dataset of video and image data. Specifically, we added video data from the LLaVA-Video-178K dataset and four public datasets: ActivityNet-QA (Yu et al., 2019), NExT-QA (Xiao et al., 2021), Perception Test (Pătrăucean et al., 2023), and LLaVA-Hound-255K (Zhang et al., 2024d).
Dataset Splits Yes For ablation studies in . 4.2 and Sec. 4.3, we conduct evaluation across 4 datasets. NExT-QA (Xiao et al., 2021) and Perception Test (Pătrăucean et al., 2023), which use training data from the LLaVA-Video-178K, are treated as in-domain datasets. Conversely, Video MME (Fu et al., 2024) and Ego Schema (Mangalam et al., 2024) are consider as zero-shot datasets. We fine-tune LLaVA-One Vision (SI) on the joint dataset of video and image data. Specifically, we added video data from the LLaVA-Video-178K dataset and four public datasets: ActivityNet-QA (Yu et al., 2019), NExT-QA (Xiao et al., 2021), Perception Test (Pătrăucean et al., 2023), and LLaVA-Hound-255K (Zhang et al., 2024d), focusing on videos shorter than three minutes.
Hardware Specification Yes On 128 NVIDIA H100 GPUs, the video representations for LLaVA-Video-7B and LLaVA-Video-72B are V = (64, 679, 1, 2) and V = (64, 679, 3, 2), respectively. ...with the Qwen2-72B model, we could only process 8 frames before maxing out the memory on 128 NVIDIA H100 GPUs.
Software Dependencies No The paper mentions several models and tools like GPT-4o, Py Scene Detect, SigLIP, Qwen2, and sentence-transformer with citations, but does not specify software versions for programming languages, libraries, or operating systems used for the experiments.
Experiment Setup Yes We fine-tune LLaVA-One Vision (SI) on the joint dataset of video and image data. Specifically, we added video data from the LLaVA-Video-178K dataset and four public datasets: ActivityNet-QA (Yu et al., 2019), NExT-QA (Xiao et al., 2021), Perception Test (Pătrăucean et al., 2023), and LLaVA-Hound-255K (Zhang et al., 2024d), focusing on videos shorter than three minutes. These datasets were selected to improve our model’s performance, contributing to a total of 1.6 million video-language samples, which include 193,510 video descriptions, 1,241,412 open-ended questions, and 215,625 multiple-choice questions. Additionally, we used 1.1 million image-language pairs from the LLaVA-One Vision model (Li et al., 2024c). We consider the same video representation configurations for the training and inference stages. On 128 NVIDIA H100 GPUs, the video representations for LLaVA-Video-7B and LLaVA-Video-72B are V = (64, 679, 1, 2) and V = (64, 679, 3, 2), respectively.