Video Event Extraction via Tracking Visual States of Arguments

Authors: Guang Yang, Manling Li, Jiajie Zhang, Xudong Lin, Heng Ji, Shih-Fu Chang

AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on various video event extraction tasks demonstrate significant improvements compared to state-of-the-art models.
Researcher Affiliation Academia 1Tsinghua University 2University of Illinois at Urbana-Champaign 3Columbia University
Pseudocode No The paper describes its methods in prose and uses figures to illustrate concepts, but it does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes Our Code is publicly available at https://github.com/Shinetism/VStates for research purposes.
Open Datasets Yes We evaluate our model on Vid Situ (Sadhu et al. 2021), the public video dataset providing extensive verb and argument structure annotations for more than 130k video clips.
Dataset Splits Yes Table 1: Data statistics. Train Valid Test-Verb Test-Role # Clip 118,130 6,630 6,765 7,990 # Verb 118,130 66,300 67,650 79,900 # Role 118,130 19,890 20,295 23,970
Hardware Specification Yes The training took about 20 hour on 4 V100 GPUs, comparable to the original Slow Fast baseline.
Software Dependencies No The paper mentions using an Adam optimizer and setting hyperparameters like learning rate and batch size, but it does not specify software dependencies such as Python, PyTorch/TensorFlow, or CUDA versions.
Experiment Setup Yes For verb classification task, We trained our model for 10 epochs and report the model with highest validation F1@5 score. ... We keep dc = 128 in our experiments. We set the maximum number of objects as 8, and rank objects in terms of the detection confidence. The learning rate is chosen from {10−4, 3 × 10−5} and the batch size for training is set to 8. We use the Adam optimizer with β1 = 0.9, β2 = 0.99 and ϵ = 10−8, and no learning rate scheduler is applied to our training.