Video Event Extraction with Multi-View Interaction Knowledge Distillation
Authors: Kaiwen Wei, Runyan Du, Li Jin, Jian Liu, Jianhua Yin, Linhao Zhang, Jintao Liu, Nayu Liu, Jingyuan Zhang, Zhi Guo
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments show that without any additional parameters, MID achieves the stateof-the-art performance compared to other strong methods in VEE. We conduct extensive experiments on the large-scale Vid Situ (Sadhu et al. 2021) dataset, and the experimental results have justified the effectiveness of the proposed MID. |
| Researcher Affiliation | Collaboration | 1College of Computer Science, Chongqing University, Chongqing, China 2University of Chinese Academy of Sciences, Beijing, China 3Beijing Jiaotong University, Beijing, China 4School of Computer Science and Technology, Shandong University, Qingdao, China 5School of Computer Science and Technology, Tiangong University, Tianjin, China 6Kuaishou Technology Inc., Beijing, China |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks with explicit labels like "Pseudocode" or "Algorithm." |
| Open Source Code | No | The relevant code will be released to facilitate research in the related area. |
| Open Datasets | Yes | We conduct extensive experiments on the large-scale Vid Situ (Sadhu et al. 2021) dataset, which is a large-scale video understanding dataset with over 130,000 video clips. The specific dataset statistics are illustrated in Table 1. The results on the test set are hidden and displayed in the leaderboard1. 1https://leaderboard.allenai.org/vidsitu-verbs/submissions/public |
| Dataset Splits | Yes | The specific dataset statistics are illustrated in Table 1. Train Valid Test-Verb Test-Role Clip 118,130 6,630 6,765 7,990 Verb 118,130 66,300 67,650 79,900 Role 118,130 19,890 20,295 23,970 |
| Hardware Specification | Yes | All the experiments are conducted on 4 V100 GPUs. |
| Software Dependencies | No | The paper mentions software components like "Adam optimizer" and the "transformer decoder" but does not specify their versions (e.g., PyTorch 1.x, TensorFlow 2.x, Python 3.x). |
| Experiment Setup | Yes | The batch size is set as 16. We leverage the Adam optimizer with 1e-4 learning rate. The maximum number of objects in each frame is 8. When distilling between different blocks, the α and β in both losses are all set as 1.0. All the experiments are conducted on 4 V100 GPUs. The models are trained for 10 epochs and reported with highest validation F1@5 score. For the semantic role prediction task, the visual event embedding for each video clip remains fixed, and we only train the sequence-to-sequence model. The number of transformer decoder layers is 3. We train the model for 10 epochs and report its performance using the highest validation CIDEr obtained. The optimal hyper-parameters are obtained by grid search. |