Open-Vocabulary Video Relation Extraction
Authors: Wentao Tian, Zheng Wang, Yuqian Fu, Jingjing Chen, Lechao Cheng
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We present our results on Moments OVRE in Table 2 and compare our approach with baseline methods trained under the same training settings. Our approach outperforms baseline generative methods, achieving a higher METEOR score (+6.22) than Clip Cap and (+2.01) than GIT. We find that although GIT was pre-trained on 0.8B image-text pairs and achieved impressive performance on video captioning datasets, it did not perform as well as our approach on the OVRE task. |
| Researcher Affiliation | Academia | Wentao Tian1, Zheng Wang2 , Yuqian Fu1, Jingjing Chen1 , Lechao Cheng3 1 Shanghai Key Lab of Intell. Info. Processing, School of CS, Fudan University 2 College of Computer Science and Technology, Zhejiang University of Technology 3 Zhejiang Lab |
| Pseudocode | No | The paper describes the overall framework and its components (Video Encoder, Attentional Pooler, Text Decoder) but does not include structured pseudocode or an algorithm block. |
| Open Source Code | Yes | Our code and dataset are available at https://github.com/Iriya99/OVRE. |
| Open Datasets | Yes | Moreover, we curate the Moments-OVRE dataset, which comprises 180K videos with action-centric relation triplets, sourced from a multi-label action classification dataset. With Moments-OVRE, we further propose a cross-modal mapping model to generate relation triplets as a sequence. Our code and dataset are available at https://github.com/Iriya99/OVRE. |
| Dataset Splits | No | The data is partitioned into training and testing sets, resulting in 178,480 and 8,463 videos respectively. The paper does not explicitly state a separate validation set split or its proportions. |
| Hardware Specification | Yes | We trained the networks for 50 epochs on 8 Nvidia V100 GPUs |
| Software Dependencies | No | The paper mentions software like CLIP, GPT-2, Sim CSE, and Adam W optimizer, but does not provide specific version numbers for these or for programming languages and libraries (e.g., Python, PyTorch, CUDA). |
| Experiment Setup | Yes | We train the generation model using cross-entropy loss and employ teacher forcing to accelerate the training process. All models are optimized using Adam W optimizer, with β1 = 0.9, β2 = 0.999, a batch size of 16, and weight decay of 1e-3. The initial learning rate is set to 1e-6 for CLIP, 2e-5 for GPT-2, and 1e-3 for Attention Pooler. We applied learning rate warm-up during the early 5% training steps followed by cosine decay. |