HAWK: Learning to Understand Open-World Video Anomalies

Authors: Jiaqi Tang, Hao LU, RUIZHENG WU, Xiaogang Xu, Ke Ma, Cheng Fang, Bin Guo, Jiangbo Lu, Qifeng Chen, Yingcong Chen

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The final results demonstrate that HAWK achieves SOTA performance, surpassing existing baselines in both video description generation and question-answering.
Researcher Affiliation Collaboration 1The Hong Kong University of Science and Technology (Guangzhou) 2The Hong Kong University of Science and Technology 3HKUST(GZ) Smart More Joint Lab 4Smart More Corporation 5The Chinese University of Hong Kong 6Zhejiang University 7Northwestern Polytechnical University
Pseudocode No The paper describes the methodology using text and diagrams (Figure 3 and 4) but does not provide structured pseudocode or an algorithm block.
Open Source Code Yes Our codes/dataset/demo will be released at https://github.com/jqtangust/hawk.
Open Datasets Yes Stage 1 involves pre-training on the Web Vid dataset [3] to acquire a general understanding of video content. In Stage 2, we finetune the model s focus towards video anomaly understanding by employing a specially curated dataset described in Section 1, consisting of over 8, 000 videos. These seven datasets include a variety of anomalous scenarios such as crime (UCF-Cirme [33]), campus (Shanghai Tech [22] and CUHK Avenue [23]), pedestrian walkways (UCSD Ped1 [6] and Ped2 [37]), traffic (Do TA [45]), and human behavior (UBnormal [2]). Our codes/dataset/demo will be released at https://github.com/jqtangust/hawk.
Dataset Splits No We use 90% of these videos for training and allocate the remaining 10% for testing purposes. We jointly train on two tasks: video <DESCRIPTION> generation and video <QUESTION> <ANSWERING>.
Hardware Specification Yes During the pre-training phase, we utilized four Nvidia RTX A6000 GPUs* to train on the Web Vid dataset [3] for approximately 120 hours. In the fine-tuning phase, we employed two Nvidia RTX A6000 GPUs to fine-tune on our proposed dataset for about 80 hours.
Software Dependencies No The paper mentions several software components like BLIP-2 [17], LLa MA-2 [35], GPT-4 [1], Gunnar Farneback’s algorithm [11, 12], Intern Video [38], Tag2Text [14], GRi T [39], Flownet S [7], and Flownet C [7]. However, it does not provide specific version numbers for these software libraries or tools.
Experiment Setup Yes In the loss function, t0 is set to 1 for our main task, video-to-language, and t1 and t2 are set to 0.1, as two auxiliary tasks for balancing different loss values.