Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

EgoThinker: Unveiling Egocentric Reasoning with Spatio-Temporal CoT

Authors: Baoqi Pei, Yifei Huang, Jilan Xu, Yuping He, Guo Chen, Fei Wu, Jiangmiao Pang, Yu Qiao

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results show that Ego Thinker outperforms existing methods across multiple egocentric benchmarks, while achieving substantial improvements in fine-grained spatio-temporal localization tasks. Full code and data are released at https://github.com/Intern Robotics/Ego Thinker.
Researcher Affiliation	Collaboration	1Shanghai Artificial Intelligence Laboratory, 2Zhejiang University, 3The University of Tokyo, 4Fudan University, 5Nanjing University EMAIL; EMAIL
Pseudocode	No	The paper describes methods and filtering pipelines (e.g., Section 3.1.1 'Egocentric Video Collection' and Figure 2 'Data Filtering Pipeline'), and discusses algorithms like GRPO (Section 3.2.2), but does not present any of these in a structured pseudocode or algorithm block format.
Open Source Code	Yes	Full code and data are released at https://github.com/Intern Robotics/Ego Thinker.
Open Datasets	Yes	First, we introduce Ego Re-5M, a large-scale egocentric QA dataset constructed from 13M diverse egocentric video clips. This dataset features multi-minute segments annotated with detailed CoT rationales and dense hand object grounding. ... Full code and data are released at https://github.com/Intern Robotics/Ego Thinker.
Dataset Splits	Yes	To balance training efficiency and model performance, we carefully curated our training dataset as shown in Table 1. While using the complete dataset would lead to prohibitive computational costs and performance degradation due to data imbalance, We filter each dataset : for video caption dataset, we select 170K samples on total; for ego-related dataset, we select 390k QA samples in total; for our Ego Re-5M, we select 810K samples, including 410K from short-term splits, 400K from long-term split and 50K from CoT split.
Hardware Specification	Yes	We utilize 32 A100 GPUs and train for 30 hours. For RFT, we adpot lr = 1e-5, epoch = 1 for training. We utilize 8 A100 GPUs and train for 12 hours.
Software Dependencies	No	The paper mentions several models and frameworks like 'Qwen2-VL-7B', 'Intern Video backbone', 'Deep Seek-V3', 'Videochat2-HD', 'Deep Seek R1', and 'GRPO'. However, it does not specify software dependencies with version numbers (e.g., Python 3.x, PyTorch 1.x, CUDA x.x) necessary to replicate the experiments.
Experiment Setup	Yes	For SFT, we adpot max pixels = 200704, min pixels = 3136, lr = 1e-6, epoch = 1 for training. We utilize 32 A100 GPUs and train for 30 hours. For RFT, we adpot lr = 1e-5, epoch = 1 for training. We utilize 8 A100 GPUs and train for 12 hours.