Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

PANDA: Towards Generalist Video Anomaly Detection via Agentic AI Engineer

Authors: Zhiwei Yang, Chen Gao, Mike Zheng Shou

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments demonstrate that PANDA achieves state-of-the-art performance in multi-scenario, open-set, and complex scenario settings without training and manual involvement, validating its generalizable and robust anomaly detection capability.
Researcher Affiliation	Academia	Zhiwei Yang1,2 Chen Gao2 Mike Zheng Shou2 1Xidian University 2Show Lab, National University of Singapore
Pseudocode	No	The paper describes the methodology narratively and does not contain any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Code is released at https://github.com/showlab/PANDA.
Open Datasets	Yes	We evaluate PANDA on four benchmarks: UCF-Crime [17], XD-Violence [16], UBnormal [34], and CSAD, which represent three distinct settings multi-scenario (UCF-Crime and XD-Violence), open-set (UBnormal), and complex scenario (CSAD).
Dataset Splits	Yes	UCF-Crime is a large-scale dataset comprising 1,900 long, untrimmed real-world surveillance videos. It covers 13 types of abnormal events such as fighting, abuse, stealing, arson, robbery, and traffic accidents. The training set includes 800 normal and 810 abnormal videos, while the test set consists of 150 normal and 140 abnormal videos. XD-Violence is another large-scale dataset focused on violence detection. It contains 4,754 videos collected from surveillance video, movies, and CCTV sources, encompassing 6 categories of anomaly events. The training and test sets include 3,954 and 800 videos, respectively. [...] CSAD is a complex-scene anomaly detection benchmark constructed in this work. It consists of 100 videos (50 normal and 50 abnormal), sampled from UCF-Crime, XD-Violence, and UBnormal.
Hardware Specification	Yes	We adopt Langgraph [35] to build the whole agent framework and all experiments are implemented using Py Torch [36] on the A6000 GPU.
Software Dependencies	Yes	We adopt Langgraph [35] to build the whole agent framework and all experiments are implemented using Py Torch [36] on the A6000 GPU. We use Qwen2.5VL-7B [28] as the VLM for perception and reasoning stages, and Gemini 2.0 Flash [27] as the MLLM for planning and reflection. During the RAG process, the anomaly knowledge base and environment information are encoded via the all-Mini LM-L6-v2 model [37], with the knowledge base indexed using FAISS for efficient similarity retrieval.
Experiment Setup	Yes	To improve the inference efficiency, the input video is sampled at 1 FPS, and a video clip of s = 5 frames is inferred at each time step. PANDA supports both offline and online reasoning modes. In offline reasoning mode, the perception phase is sampling M = 300 frames uniformly for the whole video, while only the initial M = 10 frames are sampled in online mode. The number of knowledge entries for each type of anomalous event in the anomaly knowledge base is H = 20. The maximum number of reflection rounds r is set to 3. The short Co M length l = 5 during the reasoning stage. We retrieve the top k = 5 anomaly rules from the anomaly knowledge base for each user query.