Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Talk2Event: Grounded Understanding of Dynamic Scenes from Event Cameras

Authors: Lingdong Kong, Dongyue Lu, Alan Liang, Rong Li, Yuhao Dong, Tianshuai Hu, Lai Xing Ng, Wei Ooi, Benoit Cottereau

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We introduce Talk2Event, the first large-scale benchmark for language-driven object grounding in event-based perception. Built from real-world driving data, we provide over 30,000 validated referring expressions, each enriched with four grounding attributes… To fully exploit these cues, we propose Event Refer, an attribute-aware grounding framework that dynamically fuses multi-attribute representations through a Mixture of Event-Attribute Experts (MoEE). Our method adapts to different modalities and scene dynamics, achieving consistent gains over state-of-the-art baselines in event-only, frame-only, and event-frame fusion settings. 5 Experiments 5.1 Experimental Settings Baselines & Competitors. We benchmark Event Refer against three groups of methods. 1Frame Only: we retrain traditional visual grounding methods [41, 37] on Talk2Event and report zero-shot results from the large-scale generalist models [67, 58, 16]. 2Event-Only: as no event-based grounding method exists, we adapt leading event perception methods [30, 92, 72, 118, 83] by attaching a DETR Transformer and a grounding head. 3Event-Frame Fusion: we re-implement leading event-frame fusion perception methods [27, 8, 111, 63] under the same DETR Transformer and grounding head.
Researcher Affiliation Academia 1NUS 2CNRS@CREATE 3HKUST(GZ) 4NTU 5HKUST 6I2R, A*STAR 7IPAL, CNRS IRL 2955, Singapore 8Cer Co, CNRS UMR 5549, Université Toulouse III
Pseudocode No The paper describes the methodology using text and diagrams (Figure 3, for example), but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code Yes Code & Dataset: talk2event/toolkit
Open Datasets Yes We introduce Talk2Event, the first large-scale benchmark for language-driven object grounding in event-based perception. Built from real-world driving data, we provide over 30,000 validated referring expressions... The Talk2Event dataset is released under the Attribution-Share Alike 4.0 International (CC BY-SA 4.0)1 license. E.1 Public Datasets Used DSEC2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CC BY-SA 4.0 License 2https://dsec.ifi.uzh.ch.
Dataset Splits Yes The dataset is partitioned into a training split of 4,433 scenes and a test split of 1,134 scenes. These scenes span a wide temporal range, with each sequence providing both high-speed and low-light conditions, ensuring robustness to dynamic and challenging scenarios. In total, we annotate 13,458 unique object instances, with each object grounded via three distinct referring expressions, yielding 30,690 validated captions. All models are evaluated using the same test split of Talk2Event. The test split contains 1,134 unique scenes and 3,137 objects, covering all 7 traffic-related categories with varying density and occlusion levels.
Hardware Specification Yes All models in our benchmark, including Event Refer and the baselines, are implemented using Py Torch [71] and trained on NVIDIA RTX A6000 GPUs.
Software Dependencies No We use the Ro BERTa-base [61] tokenizer and encoder for all methods that process natural language inputs... All models in our benchmark, including Event Refer and the baselines, are implemented using Py Torch [71] and trained on NVIDIA RTX A6000 GPUs. The paper mentions software like PyTorch and RoBERTa but does not specify their version numbers.
Experiment Setup Yes We use Adam W [62] as the optimizer with a weight decay of 0.01. Learning rates are as follows: 1 10 6 for the frame backbone, 5 10 6 for the Ro BERTa text encoder, 5 10 5 for the event backbone, fusion module, and transformer layers. All models are trained with a batch size of 16 and warm-up for the first 500 steps, followed by cosine decay. Training is conducted for 90K steps on the training split of Talk2Event, and models are selected using the best m Acc on a held-out validation split.