Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
EgoDTM: Towards 3D-Aware Egocentric Video-Language Pretraining
Authors: Boshen Xu, Yuting Mei, liu xinbi, Sipeng Zheng, Qin Jin
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate Ego DTM s superior performance across diverse downstream tasks, highlighting its superior 3D-aware visual understanding. Code: https://github.com/xuboshen/Ego DTM. |
| Researcher Affiliation | Collaboration | Boshen Xu1 Yuting Mei1 Xinbi Liu1 Sipeng Zheng2 Qin Jin1 1 AIM3 Lab, Renmin University of China 2 Being Beyond |
| Pseudocode | No | The paper describes methods in text and with architectural diagrams (e.g., Figure 2) but does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code: https://github.com/xuboshen/Ego DTM. |
| Open Datasets | Yes | With the emergence of large-scale egocentric datasets [22], video-language pretraining [36, 87] has become a dominant paradigm for learning egocentric video representations, significantly improving performance on downstream tasks such as video-text retrieval [11, 67] and action recognition [66, 34]. Ego4D [22]. Epic-Kitchens [11]. EGTEA [34]. H2O [29]. |
| Dataset Splits | Yes | Our pretraining data consists of four million (video, text) pairs, with each video approximately 1 second long. For the natural language query task, it comprises 1,659 untrimmed videos, each averaging 500 seconds in duration. Following the official split from [22], we use 11,291 queries for training and 3,874 for validation. Epic-Kitchens-100 (EK-100) consists of 100 hours of egocentric cooking videos divided into training (67,217 clips), validation (9,668 clips), and testing (13,092 clips) splits. H2O... the train/val splits have 7862/11638 frames. For our experiments on EGTEA, we use only the visual frames as input. We follow prior works [27, 87] and report top-1 accuracy and mean class accuracy on all three test splits, including 2,022 testing instances for each split. |
| Hardware Specification | Yes | Ego DTM is then trained for two epochs on 8*A800 GPUs, which requires approximately 10 hours and a learning rate of 3e-5. |
| Software Dependencies | No | The paper mentions several foundation models and tools like DINOv2, SAM2, Deep Seek-LLM, Faster-RCNN, and VSLNet, but it does not specify version numbers for general software dependencies like programming languages or libraries (e.g., Python, PyTorch versions). |
| Experiment Setup | Yes | Ego DTM is then trained for two epochs on 8*A800 GPUs, which requires approximately 10 hours and a learning rate of 3e-5. The hidden dimension of the dual encoders is 768, while the 3D-aware decoder uses a dimension of 256 for efficient design. We use frames with 224p as input and 56p as output of the depth maps. Consequently, our 3D-aware decoder only has 9M parameters, and the batch size is set to 4096. We perform video-text matching with 16 frames as input for EK100MIR and 4 frames for Ego MCQ, following [86]. During inference, we apply three spatial crops of size 224 224 from each 256 256 frame of 10 video clip, averaging predictions across these crops to produce the final results. The model is trained for 10 epochs with a batch size of 64 and a learning rate of 0.0005, where the first 1.5 epochs serve as a warm-up phase. |