Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Rethinking Scale-Aware Temporal Encoding for Event-based Object Detection
Authors: Lin Zhu, Tengyu Long, Xiao Wang, Lizhi Wang, Hua Huang
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on Gen1, 1 Mpx and e Tram dataset demonstrate that our approach achieves superior accuracy over recent transformer-based models, highlighting the importance of precise temporal feature extraction in early stages. |
| Researcher Affiliation | Academia | 1 School of Computer Science, Beijing Institute of Technology 2 School of Computer Science and Technology, Anhui University 3 School of Artificial Intelligence, Beijing Normal University EMAIL, EMAIL EMAIL, EMAIL, EMAIL |
| Pseudocode | No | The paper describes methods and architectures but does not include explicit pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code: https://github.com/BIT-Vision/SATE |
| Open Datasets | Yes | Experiments on the Gen1, 1Mpx and e Tram benchmark demonstrate that our approach achieves state-of-the-art performance, outperforming recent transformer-based and SNN-based methods, validating the effectiveness of early-stage temporal modeling. |
| Dataset Splits | Yes | The Gen1 automotive dataset [5] consists of 39 hours of event camera recordings with a resolution of 304 240. It includes 228k annotated bounding boxes for vehicles and 28k for pedestrians, with available annotation frequencies of 1, 2, or 4 Hz. Following the evaluation protocol of previous works [31, 22, 11], we discard bounding boxes with side lengths smaller than 10 pixels and diagonal lengths shorter than 30 pixels. Similarly, the 1 Mpx dataset [31] focuses on driving scenarios but provides several months of higher-resolution (1280 720) daytime and nighttime recordings. It contains approximately 15 hours of event data, annotated at 30 or 60 Hz, with around 25 million bounding box labels distributed across three categories: vehicles, pedestrians, and two-wheelers. We adhere to the same evaluation protocol, removing bounding boxes with side lengths smaller than 20 pixels and diagonal lengths shorter than 60 pixels, and downsample the input resolution to 640 360. Unlike the Gen1 [5] and 1 Mpx [31] datasets, the e Tram dataset [37] is a traffic monitoring dataset collected from a roadside perspective, thus exhibiting higher sparsity. e Tram contains approximately 10 hours of data with a resolution of 1280 720, including around 2 million annotated bounding boxes across 8 categories, with annotations provided at 30 Hz. The preprocessing procedure of the e Tram dataset is similar to that of the 1 Mpx dataset. For all datasets, mean Average Precision (m AP) [23] is considered as the primary metric. |
| Hardware Specification | Yes | The training takes approximately 4 days on a single RTX 3090 GPU. On the 1 Mpx dataset, we train with a batch size of 8, sequence length of 5, and learning rate of 3e 4 for 800k iterations on a single RTX 3090 GPU. |
| Software Dependencies | No | The paper mentions software components like ADAM optimizer, One Cycle learning rate scheduler, YOLOv6, Conv LSTM, and Feature Pyramid Network, but does not provide specific version numbers for these or other key software libraries like PyTorch, TensorFlow, or CUDA. |
| Experiment Setup | Yes | During training, we adopt the ADAM optimizer [19] along with a One Cycle [36] learning rate scheduler, which linearly decays from its peak value. Following the strategy in RVT [11], we employ a mixed batch training technique, where standard Backpropagation Through Time (BPTT) is applied to half of the batch samples, while Truncated BPTT (TBPTT) is applied to the other half. Data augmentation includes random horizontal flipping, zoom-in, and zoom-out operations. The event representation is constructed as a 5-channel voxel grid [45] based on a 50 ms time window. For the detection head, we utilize a Feature Pyramid Network (FPN) [24] for multi-scale feature fusion, along with the detection head from YOLOv6 [21], which incorporates distribution focal loss, classification loss, and regression loss. To compare against prior works on the Gen1 dataset, we train our models with a batch size of 6, sequence length of 21, learning rate of 2e 4 for 400k iterations. On the 1 Mpx dataset, we train with a batch size of 8, sequence length of 5, and learning rate of 3e 4 for 800k iterations on a single RTX 3090 GPU. On the e Tram dataset, the model is trained for 400k iterations with a batch size of 8, a sequence length of 5, and an initial learning rate of 3e 4. |