Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Unleashing Hour-Scale Video Training for Long Video-Language Understanding

Authors: Jingyang Lin, Jialian Wu, Ximeng Sun, Ze Wang, Jiang Liu, Yusheng Su, Xiaodong Yu, Hao Chen, Jiebo Luo, Zicheng Liu, Emad Barsoum

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In our experiments, Hour-LLaVA achieves the best performance on multiple representative long video-language benchmarks, demonstrating the high quality of the Video Marathon dataset and the superiority of the Hour-LLaVA model.
Researcher Affiliation Collaboration Jingyang Lin1,2 Jialian Wu1 Ximeng Sun1 Ze Wang1 Jiang Liu1 Yusheng Su1 Xiaodong Yu1 Hao Chen1 Jiebo Luo2 Zicheng Liu1 Emad Barsoum1 1 AMD 2 University of Rochester https://videomarathon.github.io/ [...] Work was done during the internship at AMD. Corresponding author: EMAIL.
Pseudocode No The paper includes figures describing prompts for LLMs (Figures 5-10) but does not contain structured pseudocode or algorithm blocks describing the Hour-LLaVA model's methodology itself.
Open Source Code Yes URLs to data and code are attached in the supplementary materials.
Open Datasets Yes To this end, we introduce Video Marathon, a large-scale video instruction-following dataset specifically designed for long-form video-language modeling. [...] Video Marathon integrates five representative public video datasets: Panda-70M [8], Ego4D [17], Activity Net [4], You Cook2 [72], and Movie Chat-1K [47]. [...] URLs to data and code are attached in the supplementary materials.
Dataset Splits Yes We evaluate our models on four mainstream video-language benchmarks: Temp Compass [36], Long Video Bench [58], Video-MME [14], and LVBench [55].
Hardware Specification Yes We train Hour-LLaVA-3B with 64 AMD MI250 GPUs and Hour-LLaVA-7B with 64 AMD MI300X GPUs, respectively.
Software Dependencies No The paper mentions optimizers (AdamW) and LLM decoders (Qwen2.5-3B-Instruct, Qwen2-7B-Instruct) but does not provide specific version numbers for ancillary software dependencies like programming languages or libraries (e.g., Python, PyTorch, CUDA).
Experiment Setup Yes For video-language training, we set the global batch sizes to 128 and 256 for the 3B and 7B models, respectively. A learning rate of 2e-5 is used with a 0.03 warmup ratio under a cosine annealing schedule. The models are optimized using the AdamW [37] optimizer with a cross-entropy loss. [...] Table 8: Detailed training schedule for each training stage of Hour-LLaVA, including compression details, data usage, and training hyperparameters.