Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
VideoRoPE: What Makes for Good Video Rotary Position Embedding?
Authors: Xilin Wei, Xiaoran Liu, Yuhang Zang, Xiaoyi Dong, Pan Zhang, Yuhang Cao, Jian Tong, Haodong Duan, Qipeng Guo, Jiaqi Wang, Xipeng Qiu, Dahua Lin
ICML 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate that Video Ro PE consistently achieves superior performance compared to other Ro PE variants. For example, Video Ro PE outperforms previous M-Ro PE on long video retrieval (+12.4 on V-NIAH, +12.4 on V-NIAH-D), video understanding (+2.9 on Long Video Bench, +4.5 on MLVU, +1.7 on Video-MME) and hallucination (+11.9 on Video Hallucer) benchmarks. ... Section 5. Experiment |
| Researcher Affiliation | Academia | 1Fudan University, Shanghai, China 2Shanghai AI Laboratory, Shanghai, China 3Shanghai Innovation Institute, Shanghai, China 4The Chinese University of Hong Kong 5CPII under Inno HK. Correspondence to: Yuhang Zang <EMAIL>, Qipeng Guo <EMAIL>, Jiaqi Wang <EMAIL>. |
| Pseudocode | No | The paper includes mathematical equations (Eq. 1-7) describing the model components and their interactions, but no clearly labeled pseudocode blocks or algorithms are present. |
| Open Source Code | Yes | Our code is available at https://github.com/Wiselnn570/Video Ro PE. |
| Open Datasets | Yes | We use a subset of LLa VA-Video-178k dataset (Zhang et al., 2024e) to train Video Ro PE. ... We evaluate our approach using six video benchmarks, including tasks related to long video understanding, long video retrieval, and video hallucination. For long video understanding, we use Long Video Bench (Wu et al., 2024a) (8 seconds to 1 hour), MLVU (Zhou et al., 2024) (3 minutes to 2 hours), and Video-MME (Fu et al., 2024) (11 seconds to 60 minutes). For long video retrieval, we use Vision Needle-in-a-Haystack (V-NIAH) (Zhang et al., 2024d) and our proposed extension, Vision Needle-in-a-Haystack with Distractors (V-NIAH-D)... For video hallucination, we use Video Hallucer (Wang et al., 2024d). |
| Dataset Splits | No | We use a subset of LLa VA-Video-178k dataset (Zhang et al., 2024e) to train Video Ro PE. The LLa VA-Video-178k dataset covers 178k videos and around 5 million question-answers (QA) pairs from diverse sources such as HD-VILA (Xue et al., 2022), Kinetics (Kay et al., 2017), and Activity Net (Fabian Caba Heilbron & Niebles, 2015). To balance training efficiency and long-video comprehension, we randomly select 136k videos with durations under 2 minutes and 18k videos with durations between 2 and 3 minutes. This process yielded our training set of approximately 1.3 million pairs. The paper describes the construction of a training set and refers to various benchmarks for evaluation, but does not provide specific training/validation/test splits for the LLaVA-Video-178k subset or explicitly detail how splits were managed for the evaluation benchmarks beyond using them for evaluation. |
| Hardware Specification | Yes | Our fine-tuning process employs a batch size of 128, a cosine scheduler with a learning rate of 1e-5, a warm-up ratio of 1e-2, and 704 Nvidia-A100 GPU hours in total. |
| Software Dependencies | No | All models are initialized with the Vision Transformer from Qwen2-VL-7B and LLM (Vanilla Ro PE) from Qwen2-7B (Yang et al., 2024a). Our fine-tuning incorporates our Video Ro PE to process the spatiotemporal nature of the video data effectively. We adopt Qwen2-VL s finetuning settings... We use the v LLM framework (Kwon et al., 2023) to support inference on sequences longer than 32k tokens. The paper mentions specific models and a framework but does not provide version numbers for underlying software dependencies like Python, PyTorch, or CUDA. |
| Experiment Setup | Yes | Our fine-tuning process employs a batch size of 128, a cosine scheduler with a learning rate of 1e-5, a warm-up ratio of 1e-2, and 704 Nvidia-A100 GPU hours in total. The evaluation involves sampling videos at 2 fps with a minimum of 144 image tokens per frame. We use the v LLM framework (Kwon et al., 2023) to support inference on sequences longer than 32k tokens. ... We adopt Qwen2-VL s finetuning settings, processing each video at 2 fps with a maximum of 128 frames and dynamically adjusting the image resolution to maintain a consistent token count. However, to prevent memory overflow, we use a context window of 8192 tokens. |