Video Event Extraction via Tracking Visual States of Arguments
Authors: Guang Yang, Manling Li, Jiajie Zhang, Xudong Lin, Heng Ji, Shih-Fu Chang
AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on various video event extraction tasks demonstrate significant improvements compared to state-of-the-art models. |
| Researcher Affiliation | Academia | 1Tsinghua University 2University of Illinois at Urbana-Champaign 3Columbia University |
| Pseudocode | No | The paper describes its methods in prose and uses figures to illustrate concepts, but it does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our Code is publicly available at https://github.com/Shinetism/VStates for research purposes. |
| Open Datasets | Yes | We evaluate our model on Vid Situ (Sadhu et al. 2021), the public video dataset providing extensive verb and argument structure annotations for more than 130k video clips. |
| Dataset Splits | Yes | Table 1: Data statistics. Train Valid Test-Verb Test-Role # Clip 118,130 6,630 6,765 7,990 # Verb 118,130 66,300 67,650 79,900 # Role 118,130 19,890 20,295 23,970 |
| Hardware Specification | Yes | The training took about 20 hour on 4 V100 GPUs, comparable to the original Slow Fast baseline. |
| Software Dependencies | No | The paper mentions using an Adam optimizer and setting hyperparameters like learning rate and batch size, but it does not specify software dependencies such as Python, PyTorch/TensorFlow, or CUDA versions. |
| Experiment Setup | Yes | For verb classification task, We trained our model for 10 epochs and report the model with highest validation F1@5 score. ... We keep dc = 128 in our experiments. We set the maximum number of objects as 8, and rank objects in terms of the detection confidence. The learning rate is chosen from {10−4, 3 × 10−5} and the batch size for training is set to 8. We use the Adam optimizer with β1 = 0.9, β2 = 0.99 and ϵ = 10−8, and no learning rate scheduler is applied to our training. |