VadCLIP: Adapting Vision-Language Models for Weakly Supervised Video Anomaly Detection

Authors: Peng Wu, Xuerong Zhou, Guansong Pang, Lingru Zhou, Qingsen Yan, Peng Wang, Yanning Zhang

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct extensive experiments on two commonly-used benchmarks, demonstrating that Vad CLIP achieves the best performance on both coarse-grained and fine-grained WSVAD, surpassing the state-of-the-art methods by a large margin. Specifically, Vad CLIP achieves 84.51% AP and 88.02% AUC on XDViolence and UCF-Crime, respectively. Code and features are released at https://github.com/nwpu-zxr/Vad CLIP. Extensive ablations are carried out on XD-Violence dataset.
Researcher Affiliation Academia Peng Wu1, Xuerong Zhou1, Guansong Pang2*, Lingru Zhou1, Qingsen Yan1, Peng Wang1 , Yanning Zhang1 1ASGO, School of Computer Science, Northwestern Polytechnical University, China 2School of Computing and Information Systems, Singapore Management University, Singapore {xdwupeng, zxr2333}@gmail.com, gspang@smu.edu.sg, {lingruzhou, yqs}@mail.nwpu.edu.cn, {peng.wang, ynzhang}@nwpu.edu.cn
Pseudocode No No structured pseudocode or algorithm blocks were found in the paper.
Open Source Code Yes Code and features are released at https://github.com/nwpu-zxr/Vad CLIP.
Open Datasets Yes We conduct experiments on two popular WSVAD datasets, i.e., UCF-Crime and XD-Violence. Notably, training videos only have video-level labels on both datasets.
Dataset Splits No The paper mentions "training videos only have video-level labels" and later "on the abnormal videos in the test set", but does not provide specific details on training, validation, and test splits (e.g., percentages, counts, or explicit predefined splits).
Hardware Specification Yes Vad CLIP is trained on a single NVIDIA RTX 3090 GPU using PyTorch.
Software Dependencies No The paper mentions "PyTorch" but does not specify its version number or versions for any other software libraries or solvers.
Experiment Setup Yes For hyper-parameters, we set σ in Eq.3 as 1, τ in Eq.8 as 0.07, and the context length l as 20. For window length in LGT-Adapter, we set it as 64 and 8 on XD-Violence and UCF-Crime, respectively. For λ in Eq.10, we set it as 1 10 4 and 1 10 1 on XD-Violence and UCF-Crime, respectively. For model training, Vad CLIP is trained on a single NVIDIA RTX 3090 GPU using Py Torch. We use Adam W as the optimizer with batch size of 64. On XDViolence, the learning rate and total epoch are set as 2 10 5 and 20, respectively, and on UCF-Crime, the learning rate and total epoch are set as 1 10 5 and 10, respectively.