Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Unleashing Hour-Scale Video Training for Long Video-Language Understanding
Authors: Jingyang Lin, Jialian Wu, Ximeng Sun, Ze Wang, Jiang Liu, Yusheng Su, Xiaodong Yu, Hao Chen, Jiebo Luo, Zicheng Liu, Emad Barsoum
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In our experiments, Hour-LLaVA achieves the best performance on multiple representative long video-language benchmarks, demonstrating the high quality of the Video Marathon dataset and the superiority of the Hour-LLaVA model. |
| Researcher Affiliation | Collaboration | Jingyang Lin1,2 Jialian Wu1 Ximeng Sun1 Ze Wang1 Jiang Liu1 Yusheng Su1 Xiaodong Yu1 Hao Chen1 Jiebo Luo2 Zicheng Liu1 Emad Barsoum1 1 AMD 2 University of Rochester https://videomarathon.github.io/ [...] Work was done during the internship at AMD. Corresponding author: EMAIL. |
| Pseudocode | No | The paper includes figures describing prompts for LLMs (Figures 5-10) but does not contain structured pseudocode or algorithm blocks describing the Hour-LLaVA model's methodology itself. |
| Open Source Code | Yes | URLs to data and code are attached in the supplementary materials. |
| Open Datasets | Yes | To this end, we introduce Video Marathon, a large-scale video instruction-following dataset specifically designed for long-form video-language modeling. [...] Video Marathon integrates five representative public video datasets: Panda-70M [8], Ego4D [17], Activity Net [4], You Cook2 [72], and Movie Chat-1K [47]. [...] URLs to data and code are attached in the supplementary materials. |
| Dataset Splits | Yes | We evaluate our models on four mainstream video-language benchmarks: Temp Compass [36], Long Video Bench [58], Video-MME [14], and LVBench [55]. |
| Hardware Specification | Yes | We train Hour-LLaVA-3B with 64 AMD MI250 GPUs and Hour-LLaVA-7B with 64 AMD MI300X GPUs, respectively. |
| Software Dependencies | No | The paper mentions optimizers (AdamW) and LLM decoders (Qwen2.5-3B-Instruct, Qwen2-7B-Instruct) but does not provide specific version numbers for ancillary software dependencies like programming languages or libraries (e.g., Python, PyTorch, CUDA). |
| Experiment Setup | Yes | For video-language training, we set the global batch sizes to 128 and 256 for the 3B and 7B models, respectively. A learning rate of 2e-5 is used with a 0.03 warmup ratio under a cosine annealing schedule. The models are optimized using the AdamW [37] optimizer with a cross-entropy loss. [...] Table 8: Detailed training schedule for each training stage of Hour-LLaVA, including compression details, data usage, and training hyperparameters. |