Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Robust Egocentric Referring Video Object Segmentation via Dual-Modal Causal Intervention

Authors: Haijing Liu, Zhiyuan Song, Hefeng Wu, Tao Pu, Keze Wang, Liang Lin

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate that CERES achieves state-of-the-art performance on Ego-RVOS benchmarks, highlighting the potential of applying causal reasoning to build more reliable models for broader egocentric video understanding.
Researcher Affiliation Academia Haijing Liu, Zhiyuan Song, Hefeng Wu , Tao Pu, Keze Wang, Liang Lin Sun Yat-sen University, Guangzhou 510006, China
Pseudocode No The paper describes the methodology using equations and prose, for example, Equations (1) through (11), but no explicitly labeled 'Pseudocode' or 'Algorithm' block is present.
Open Source Code No We will release our code and detailed instructions for data preparation and reproducing the main experimental results after the review period.
Open Datasets Yes Following previous work [36], we evaluate our method on three public egocentric video datasets: VISOR [12], VOST [46], and VSCOS [62].
Dataset Splits Yes VISOR, derived from EPIC-KITCHENS [10, 11], provides annotations for hands and active object interactions; we utilize its training and validation splits. After preprocessing, this yields 13,205 videos (76,873 objects) for training and 467 videos (1,841 objects) for validation, where validation objects are manually annotated as positive or negative. VOST and VSCOS are used for validation only.
Hardware Specification Yes All experiments were conducted using Py Torch 2.1.2 and CUDA 11.8 on a system with four NVIDIA V100 GPUs.
Software Dependencies Yes All experiments were conducted using Py Torch 2.1.2 and CUDA 11.8 on a system with four NVIDIA V100 GPUs.
Experiment Setup Yes Models were trained for 6 epochs with a total batch size of 4, where each batch item was a single video clip. We initialized the learning rate to 1 × 10−3 for our CERES modules and 1 × 10−4 for pre-trained components, decaying it by 0.1 at epochs 3 and 5, using the Adam W optimizer. The primary segmentation loss combined bounding box, Dice and Focal losses. Input frames are resized to 448 × 448.