Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

OpenMMEgo: Enhancing Egocentric Understanding for LMMs with Open Weights and Data

Authors: Hao Luo, Zihao Yue, Wanpeng Zhang, Yicheng Feng, Sipeng Zheng, Deheng Ye, Zongqing Lu

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Open MMEgo consistently improves the performance of LMMs on egocentric benchmarks without sacrificing general video understanding performance. Notably, Qwen2.5-VL tuned with Open MMEgo substantially outperforms other models of the same size in egocentric video understanding. We conduct extensive evaluations on SOTA video LMMs (Li et al., 2024a; Bai et al., 2025). Our results show that Open MMEgo consistently enhances egocentric comprehension while maintaining general video understanding capabilities.
Researcher Affiliation Collaboration 1 Peking University 2 Being Beyond 3 Remin University of China 4 Tencent
Pseudocode No The paper describes the steps for Dual Semantic-aware Token Compression with numbered steps (e.g., 'Step 1: Calculate the similarities...'), but these are descriptive explanations within the text and diagrams (Figure 2) rather than formal pseudocode blocks or algorithms.
Open Source Code Yes The data, weights and training code will be put at https://github.com/BeingBeyond/OpenMMEgo.
Open Datasets Yes To provide rich spatiotemporal visual knowledge, we curate a large-scale, high-quality dataset named OME10M, comprising over 8.2M egocentric video QA pairs synthesized from Ego4D series. We also establish OMEBench, a comprehensive benchmark for rigorous egocentric understanding assessment. The data, weights and training code will be put at https://github.com/BeingBeyond/OpenMMEgo.
Dataset Splits Yes In our offline curriculum learning, we quantify the difficulty of each data via the loss value on capable LMMs... The filtered dataset is divided into three subsets in equal size, easy, medium, and hard. We then organise the datasets for a three-stage training, with the overall difficulty increasing as the stages progress. The specific recipe for each stage is presented in Table 6. Table 6: Data partition for each training stage. Stage-1: 60% Easy, 30% Medium, 10% Hard. Stage-2: 35% Easy, 50% Medium, 15% Hard. Stage-3: 5% Easy, 20% Medium, 75% Hard.
Hardware Specification Yes We train both variants for 1 epoch with a global batch size of 128 across 128 NVIDIA A800 GPUs.
Software Dependencies Yes As for the based models, we use the version of LLa VA-Video-7B-Qwen2 and Qwen2.5-VL-7B-Instruct respectively. Numerical Precision bfloat16.
Experiment Setup Yes To evaluate the effectiveness of Open MMEgo, we apply it to two state-of-the-art 7B video MLLMs... For visual token compression, we set k = 35 for STM and rl = 35 and rh = 95 for TTP... The hyperparameter α for online in-batch data dropout is set to 0.3. In our implementation, each video is processed as up to 192 frames (resized 384 × 384), with a maximum visual token context length of N = 13,440. Following the training framework of LLa VA-Next, we train both variants for 1 epoch with a global batch size of 128 across 128 NVIDIA A800 GPUs. Table 5: Training hyperparameters for instruction tuning. Hyperparameter Value Global Batch Size 128 Frame Number 192 Input Resolution 384 Learning Rate 1e-5 Weight Decay 0 Warmup Ratio 0.03 Learning Rate Scheduler cosine Numerical Precision bfloat16 Epochs 1 Max Sequence Length 32768 Max Visual Context Length 13440.