Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Aha! - Predicting What Matters Next: Online Highlight Detection Without Looking Ahead
Authors: Aiden Chang, Celso de Melo, Stephanie Lukin
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 4 Experiments This section details the comprehensive experimental evaluation of AHA. We first assess its core performance as an OHD model under strict streaming constraints on two standard HD benchmarks, TVSum and Mr.Hi Sum (Section 4.1). We then evaluate its robustness to common video degradations and conduct ablation studies to analyze the contributions of its key components (Section 4.2). To demonstrate its practical applicability in challenging real-world conditions, we further test AHA s capabilities on a long-form robotics video (Section 4.3), and generalization potential to other unoptimized video understanding tasks (Section 4.4). Our results are averaged over 5 runs. |
| Researcher Affiliation | Collaboration | Aiden Chang University of Southern California Los Angeles, CA 90089 EMAIL Celso De Melo DEVCOM Army Research Laboratory Adelphi, MD 20783 Stephanie M. Lukin DEVCOM Army Research Laboratory Adelphi, MD 20783 |
| Pseudocode | No | No explicit pseudocode or algorithm blocks are provided in the main text or appendices. The methodology is described verbally and with architectural diagrams. |
| Open Source Code | Yes | 1github.com/aiden200/Aha- The instructions on how to download the dataset will be included in the github. All code and data required to reproduce the main experimental results will be made publicly available upon acceptance, including training scripts, evaluation code, and documentation. |
| Open Datasets | Yes | We construct and release the Human Intuition Highlight Dataset (HIHD), a novel dataset of ~23k videos... AHA surpasses prior methods, including offline approaches, on the HD benchmarks TVSum [14] (+5.9% m AP) and Mr.Hisum [15] (+8.3% m AP). |
| Dataset Splits | Yes | Crucially, HIHD adopts the exact train/validation/test splits from Mr.Hi Sum to ensure fair comparability, and its training set explicitly excludes videos present in common highlight detection evaluation datasets. |
| Hardware Specification | Yes | Training was performed on 3 compute nodes, each with 2 NVIDIA A6000 GPUs (48GB VRAM), totaling 6 GPUs. The system achieved a sustained throughput of 1 frame per second (FPS), demonstrating high efficiency with 100% peak GPU utilization and 90% peak memory controller utilization. During this process, the framework consumed a peak of 30.49 GB of VRAM across both GPUs and operated well within safe thermal limits at a peak temperature of 65 C, all while maintaining a minimal system RAM footprint of 3.66 GB. |
| Software Dependencies | Yes | AHA was trained using Py Torch 2.5.1, Transformers 4.49.0, and CUDA 12.4 on Ubuntu 22.04. |
| Experiment Setup | Yes | Table 5: Key hyperparameters for training AHA. Category Hyperparameter (Value) Optimization Optimizer Adam W [45] Betas (optimizer) (0.9, 0.999) Epsilon (optimizer) 1 10 8 Weight decay 0.0 Learning rate 2 10 5 LR scheduler Cosine decay with linear warmup Warmup ratio 0.05 (0 warmup steps) Gradient norm clipping 1.0 Gradient checkpointing Enabled Batching Per-device train batch size 1 Gradient accumulation steps 2 (effective batch size = 2) Num epochs 1 Precision & Acceleration BF16 training Enabled Deep Speed zero2 [46] + CPU offload Attn implementation Flash Attention2 [47] Data loading Dataloader workers 4 Pin memory True Drop last batch False Video preprocessing Frame rate 1 fps Frame resolution 384 384 Pooling stride 4 Frame tokens (#) 49 Token pooling dims [7, 7] Model backbones LLM backbone lmms-lab/llava-onevision-qwen2-7b-ov Vision backbone google/siglip-large-patch16-384 Multimodal projector 3 3 conv + linear layers Losses & regularization Stream loss weight 1.0 TV loss window 49 Saving & logging Save strategy steps (every 25 steps) Save total limit 5 checkpoints Logging strategy steps (every 1 step) |