Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Eagle 2.5: Boosting Long-Context Post-Training for Frontier Vision-Language Models

Authors: Guo Chen, Zhiqi Li, Shihao Wang, Jindong Jiang, Yicheng Liu, Lidong Lu, De-An Huang, Wonmin Byeon, Matthieu Le, Max Ehrlich, Tong Lu, Limin Wang, Bryan Catanzaro, Jan Kautz, Andrew Tao, Zhiding Yu, Guilin Liu

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Eagle2.5 demonstrates substantial improvements on long-context multimodal benchmarks, providing a robust solution to the limitations of existing VLMs. Notably, our best model Eagle2.5-8B achieves 72.4% on Video-MME with 512 input frames, matching the results of top-tier commercial model such as GPT-4o and large-scale open-source models like Qwen2.5-VL-72B and Intern VL2.5-78B. The paper includes a dedicated "4 Experiments" section, detailing comparisons with state-of-the-art VLMs and ablation studies, presenting results in tables (Tab. 2, Tab. 3, Tab. 4, Tab. 5, Tab. 6, Tab. 7) and figures (Fig. 1, Fig. 6).
Researcher Affiliation	Collaboration	1Nanjing University, 2NVIDIA, 3Hong Kong Polytechnic University EMAIL
Pseudocode	No	The paper describes its methodologies, including Information-First Sampling and Automatic Degradation Sampling, through detailed textual explanations and mathematical formulations, but it does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	No	The paper does not include an explicit statement about releasing the source code for the methodology described, nor does it provide any direct links to a code repository.
Open Datasets	Yes	Our data recipe begins with open-source data. We embrace the diversity first, then quality principle and gather data from various open sources. This data mainly comprises high-definition multi-image/short videos, long videos, multi-page documents, and extensive text data. We also find that current open-source video data often lacks sufficient length. We thus propose a novel dataset, Eagle-Video-110K, to complement the length, as shown in Fig. 4. [...] Combined with short-context data, all collected open-source datasets are summarized in Tab. 1. For convenience, we refer to this collective dataset as Open-Data. Table 1 lists numerous datasets with citations, such as Kinetics710 [9, 101], Activity Net [8], Slide VQA [89], etc.
Dataset Splits	Yes	As shown in Tab. 2, Eagle2.5-8B demonstrates strong performance across multiple video understanding benchmarks. [...] For Video MME (w/o subtitle), the performance of Eagle 2.5 (72.4) significantly surpasses models of the same size and is extremely close to the 72B parameter model. On CG-Bench [11], it scores 55.8, 46.6, 45.6, 13.4 across metrics, exceeding Claude-3.5-Sonnet [2] (56.5, 40.3, 35.6, 4.17) and Gemini-1.5-Pro [83] (50.9, 37.8, 28.7, 3.85). With 44.5 on Hour Video [10] dev set and 41.8 on test set, all surpassing Gemini-1.5-Pro [83]. Finally, on Charade-STA [23], Eagle 2.5 outperforms other models significantly, demonstrating strong temporal perception capabilities.
Hardware Specification	Yes	Limitations. The training of Eagle2.5 required substantial computational resources, specifically a cluster of 128 H100 GPUs.
Software Dependencies	No	The paper mentions using specific models like 'Qwen2.5 series models [93]' and 'Sig LIP [123]' as foundational components. However, it does not explicitly list specific version numbers for these or any other ancillary software dependencies, such as programming languages, libraries, or frameworks (e.g., Python, PyTorch, CUDA) required for replication.
Experiment Setup	Yes	We employ a progressive mixed post-training approach, wherein context length is incrementally expanded during training, enhancing the model s ability to process inputs of varying sizes. [...] In our exeriment, we sequentially set Lmax to 32K, 64K and 128K. For each benchmark, we sampled at 2FPS, ensuring a maximum of 32 frames.