Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

VideoRoPE: What Makes for Good Video Rotary Position Embedding?

Authors: Xilin Wei, Xiaoran Liu, Yuhang Zang, Xiaoyi Dong, Pan Zhang, Yuhang Cao, Jian Tong, Haodong Duan, Qipeng Guo, Jiaqi Wang, Xipeng Qiu, Dahua Lin

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments demonstrate that Video Ro PE consistently achieves superior performance compared to other Ro PE variants. For example, Video Ro PE outperforms previous M-Ro PE on long video retrieval (+12.4 on V-NIAH, +12.4 on V-NIAH-D), video understanding (+2.9 on Long Video Bench, +4.5 on MLVU, +1.7 on Video-MME) and hallucination (+11.9 on Video Hallucer) benchmarks. ... Section 5. Experiment
Researcher Affiliation	Academia	1Fudan University, Shanghai, China 2Shanghai AI Laboratory, Shanghai, China 3Shanghai Innovation Institute, Shanghai, China 4The Chinese University of Hong Kong 5CPII under Inno HK. Correspondence to: Yuhang Zang <EMAIL>, Qipeng Guo <EMAIL>, Jiaqi Wang <EMAIL>.
Pseudocode	No	The paper includes mathematical equations (Eq. 1-7) describing the model components and their interactions, but no clearly labeled pseudocode blocks or algorithms are present.
Open Source Code	Yes	Our code is available at https://github.com/Wiselnn570/Video Ro PE.
Open Datasets	Yes	We use a subset of LLa VA-Video-178k dataset (Zhang et al., 2024e) to train Video Ro PE. ... We evaluate our approach using six video benchmarks, including tasks related to long video understanding, long video retrieval, and video hallucination. For long video understanding, we use Long Video Bench (Wu et al., 2024a) (8 seconds to 1 hour), MLVU (Zhou et al., 2024) (3 minutes to 2 hours), and Video-MME (Fu et al., 2024) (11 seconds to 60 minutes). For long video retrieval, we use Vision Needle-in-a-Haystack (V-NIAH) (Zhang et al., 2024d) and our proposed extension, Vision Needle-in-a-Haystack with Distractors (V-NIAH-D)... For video hallucination, we use Video Hallucer (Wang et al., 2024d).
Dataset Splits	No	We use a subset of LLa VA-Video-178k dataset (Zhang et al., 2024e) to train Video Ro PE. The LLa VA-Video-178k dataset covers 178k videos and around 5 million question-answers (QA) pairs from diverse sources such as HD-VILA (Xue et al., 2022), Kinetics (Kay et al., 2017), and Activity Net (Fabian Caba Heilbron & Niebles, 2015). To balance training efﬁciency and long-video comprehension, we randomly select 136k videos with durations under 2 minutes and 18k videos with durations between 2 and 3 minutes. This process yielded our training set of approximately 1.3 million pairs. The paper describes the construction of a training set and refers to various benchmarks for evaluation, but does not provide specific training/validation/test splits for the LLaVA-Video-178k subset or explicitly detail how splits were managed for the evaluation benchmarks beyond using them for evaluation.
Hardware Specification	Yes	Our ﬁne-tuning process employs a batch size of 128, a cosine scheduler with a learning rate of 1e-5, a warm-up ratio of 1e-2, and 704 Nvidia-A100 GPU hours in total.
Software Dependencies	No	All models are initialized with the Vision Transformer from Qwen2-VL-7B and LLM (Vanilla Ro PE) from Qwen2-7B (Yang et al., 2024a). Our ﬁne-tuning incorporates our Video Ro PE to process the spatiotemporal nature of the video data effectively. We adopt Qwen2-VL s ﬁnetuning settings... We use the v LLM framework (Kwon et al., 2023) to support inference on sequences longer than 32k tokens. The paper mentions specific models and a framework but does not provide version numbers for underlying software dependencies like Python, PyTorch, or CUDA.
Experiment Setup	Yes	Our ﬁne-tuning process employs a batch size of 128, a cosine scheduler with a learning rate of 1e-5, a warm-up ratio of 1e-2, and 704 Nvidia-A100 GPU hours in total. The evaluation involves sampling videos at 2 fps with a minimum of 144 image tokens per frame. We use the v LLM framework (Kwon et al., 2023) to support inference on sequences longer than 32k tokens. ... We adopt Qwen2-VL s ﬁnetuning settings, processing each video at 2 fps with a maximum of 128 frames and dynamically adjusting the image resolution to maintain a consistent token count. However, to prevent memory overﬂow, we use a context window of 8192 tokens.