Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
StreamForest: Efficient Online Video Understanding with Persistent Event Memory
Authors: Xiangyu Zeng, Kefan Qiu, Qingyu Zhang, Xinhao Li, Jing Wang, Jiaxin Li, Ziang Yan, Kun Tian, Meng Tian, Xinhai Zhao, Yi Wang, Limin Wang
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results demonstrate that Stream Forest achieves the state-of-the-art performance, with accuracies of 77.3% on Streaming Bench, 60.5% on OVBench, and 55.6% on OVO-Bench. In particular, even under extreme visual token compression (limited to 1024 tokens), the model retains 96.8% of its average accuracy in eight benchmarks relative to the default setting. |
| Researcher Affiliation | Collaboration | 1Nanjing University 2Shanghai AI Laboratory 3Zhejiang University 4Noah s Ark Lab, Huawei 5Yinwang Intelligent Tech. |
| Pseudocode | No | The paper describes its methodology through narrative text and figures (e.g., Figure 2: Overview of our proposed Stream Forest) but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | https://github.com/MCG-NJU/Stream Forest |
| Open Datasets | Yes | Our models, data and code have been released. |
| Dataset Splits | Yes | We also present Online IT, a streaming video understanding fine-tuning dataset... As well as ODV-Bench, a new benchmark tailored for real-time autonomous driving scenarios. The model is trained on 32 A100 GPUs using our proposed Online IT dataset, supplemented with offline video data from Video Chat-Flash [33] and LLa VA-Video [76], as well as image data from LLa VA-One Vision [28]. |
| Hardware Specification | Yes | The model is trained on 32 A100 GPUs using our proposed Online IT dataset, supplemented with offline video data from Video Chat-Flash [33] and LLa VA-Video [76], as well as image data from LLa VA-One Vision [28]. |
| Software Dependencies | No | The paper mentions specific models (Sig Li P-so400M as visual encoder, Qwen2-7B as LLM) and an optimizer (Adam W) but does not provide specific version numbers for software dependencies or libraries like Python or PyTorch versions. |
| Experiment Setup | Yes | By default, the number of visual tokens is capped at 8192. Among these, 729 tokens are allocated to real-time perception, while short-term spatiotemporal memory consists of 18 frames, each represented by 128 visual tokens. We set the penalty weights for similarity, merge count, and temporal distance to 0.4, 0.4, and 0.2, respectively. The model is trained on 32 A100 GPUs using our proposed Online IT dataset... The full configuration and parameter settings for the online fine-tuning phase are listed in Table 11. |