Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Unleashing Hour-Scale Video Training for Long Video-Language Understanding

Authors: Jingyang Lin, Jialian Wu, Ximeng Sun, Ze Wang, Jiang Liu, Yusheng Su, Xiaodong Yu, Hao Chen, Jiebo Luo, Zicheng Liu, Emad Barsoum

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In our experiments, Hour-LLaVA achieves the best performance on multiple representative long video-language benchmarks, demonstrating the high quality of the Video Marathon dataset and the superiority of the Hour-LLaVA model.
Researcher Affiliation	Collaboration	Jingyang Lin1,2 Jialian Wu1 Ximeng Sun1 Ze Wang1 Jiang Liu1 Yusheng Su1 Xiaodong Yu1 Hao Chen1 Jiebo Luo2 Zicheng Liu1 Emad Barsoum1 1 AMD 2 University of Rochester https://videomarathon.github.io/ [...] Work was done during the internship at AMD. Corresponding author: EMAIL.
Pseudocode	No	The paper includes figures describing prompts for LLMs (Figures 5-10) but does not contain structured pseudocode or algorithm blocks describing the Hour-LLaVA model's methodology itself.
Open Source Code	Yes	URLs to data and code are attached in the supplementary materials.
Open Datasets	Yes	To this end, we introduce Video Marathon, a large-scale video instruction-following dataset specifically designed for long-form video-language modeling. [...] Video Marathon integrates five representative public video datasets: Panda-70M [8], Ego4D [17], Activity Net [4], You Cook2 [72], and Movie Chat-1K [47]. [...] URLs to data and code are attached in the supplementary materials.
Dataset Splits	Yes	We evaluate our models on four mainstream video-language benchmarks: Temp Compass [36], Long Video Bench [58], Video-MME [14], and LVBench [55].
Hardware Specification	Yes	We train Hour-LLaVA-3B with 64 AMD MI250 GPUs and Hour-LLaVA-7B with 64 AMD MI300X GPUs, respectively.
Software Dependencies	No	The paper mentions optimizers (AdamW) and LLM decoders (Qwen2.5-3B-Instruct, Qwen2-7B-Instruct) but does not provide specific version numbers for ancillary software dependencies like programming languages or libraries (e.g., Python, PyTorch, CUDA).
Experiment Setup	Yes	For video-language training, we set the global batch sizes to 128 and 256 for the 3B and 7B models, respectively. A learning rate of 2e-5 is used with a 0.03 warmup ratio under a cosine annealing schedule. The models are optimized using the AdamW [37] optimizer with a cross-entropy loss. [...] Table 8: Detailed training schedule for each training stage of Hour-LLaVA, including compression details, data usage, and training hyperparameters.