Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
HoPE: Hybrid of Position Embedding for Long Context Vision-Language Models
Authors: Haoran Li, Yingjie Qin, Baoyuan Ou, Lai Xu, Ruiwen Xu
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments across four video benchmarks on long video understanding and retrieval tasks demonstrate that Ho PE consistently outperforms existing methods, confirming its effectiveness. |
| Researcher Affiliation | Collaboration | 1Carnegie Mellon University 2Xiaohongshu Inc. |
| Pseudocode | No | The paper describes the proposed methods (Hybrid Frequency Allocation Strategy and Dynamic Temporal Scaling Mechanism) in Section 4 using descriptive text and mathematical formulations, but it does not present them in structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code is available at https://github.com/hrlics/Ho PE. |
| Open Datasets | Yes | We train the models on a subset of LLa VA-Video-178k [40], which consists of 178k videos ranging from 0 to 3 minutes and 5M instruction samples... For long video understanding, we utilize Long Video Bench [41], Video-MME [42], and MLVU [43], covering videos ranging from a few seconds to 2 hours. For long video retrieval, we employ V-NIAH (Visual Needle-In-A-Haystack) [17]. |
| Dataset Splits | No | The paper states it uses a subset of LLa VA-Video-178k for training, comprising '30k videos with durations under 2 minutes and 3k videos with durations between 2 and 3 minutes'. It also evaluates on various benchmarks. However, specific training, validation, and test splits (e.g., percentages, sample counts, or explicit standard split references) for these datasets are not provided. |
| Hardware Specification | Yes | The entire training process taking approximately 304 GPU hours on machines equipped with H800-80GB GPUs. |
| Software Dependencies | No | The paper mentions using 'Qwen2-1.5B and Qwen2-7B' as backbone models and 'vision encoders from Qwen2-VL-2B/7B-Instruct', but does not specify any software libraries or frameworks with their version numbers (e.g., PyTorch, TensorFlow, Python version, CUDA). |
| Experiment Setup | Yes | During training, we adopt a batch size of 128, a learning rate of 1e-5(2B)/2e-5(7B) with a cosine scheduler. Following the instruction tuning settings in Qwen2-VL [2], we set the maximum video frames to 128 and the video sampling rate to 2. The training context length is set to 8k... During evaluation, the minimum tokens per frame are set to 144. |