Multi-Modal Prompting for Open-Vocabulary Video Visual Relationship Detection
Authors: Shuo Yang, Yongqi Wang, Xiaofeng Ji, Xinxiao Wu
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on two public datasets, Vid VRD and Vid OR, demonstrate the effectiveness of our method, especially achieving a significant gain of nearly 10% in m AP on novel relationship categories on the Vid VRD dataset. |
| Researcher Affiliation | Academia | 1Guangdong Laboratory of Machine Perception and Intelligent Computing, Shenzhen MSU-BIT University, China 2Beijing Key Laboratory of Intelligent Information Technology, School of Computer Science & Technology Beijing Institute of Technology, China |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. Methods are described in prose and through diagrams. |
| Open Source Code | Yes | Codes are at https://github.com/wangyongqi558/MMP OV Vid VRD |
| Open Datasets | Yes | We evaluate our method on the Vid VRD (Shang et al. 2017) and Vid OR (Shang et al. 2019) datasets. |
| Dataset Splits | Yes | The Vid VRD dataset contains 1000 videos, 800 videos for training and 200 for testing... The Vid OR dataset contains 10000 videos, 7000 videos for training, 835 videos for validation, and 2165 videos for testing |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU/CPU models, processor types, or memory amounts used for running experiments. |
| Software Dependencies | No | The paper mentions using CLIP and the Adam W algorithm but does not provide specific version numbers for software dependencies or libraries. |
| Experiment Setup | Yes | For all experiments, video frames are sampled every 30 frames. We adopt the Vi T-B/16 version of CLIP while keeping the parameters fixed. The number of Transformer blocks of spatio-temporal visual prompting is set to 1 and 2 for the Vid VRD dataset and the Vid OR dataset, respectively. The head number of multi-head self-attention of Transformer blocks is set to 8, and the dropout rate is set to 0.1. For language prompting, we set the number of tokens for both learnable continuous prompts and conditional prompts to 8. The [CLS] token is positioned at 75% of the token length. For optimization, we use the Adam W (Loshchilov and Hutter 2019) algorithm with an initial learning rate of 0.001. A multi-step decay schedule is applied at epochs 15, 20, and 25, reducing the learning rate by a factor of 0.1 each time. The batch size is set to 32. |