Open-Vocabulary Video Relation Extraction

Authors: Wentao Tian, Zheng Wang, Yuqian Fu, Jingjing Chen, Lechao Cheng

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We present our results on Moments OVRE in Table 2 and compare our approach with baseline methods trained under the same training settings. Our approach outperforms baseline generative methods, achieving a higher METEOR score (+6.22) than Clip Cap and (+2.01) than GIT. We find that although GIT was pre-trained on 0.8B image-text pairs and achieved impressive performance on video captioning datasets, it did not perform as well as our approach on the OVRE task.
Researcher Affiliation Academia Wentao Tian1, Zheng Wang2 , Yuqian Fu1, Jingjing Chen1 , Lechao Cheng3 1 Shanghai Key Lab of Intell. Info. Processing, School of CS, Fudan University 2 College of Computer Science and Technology, Zhejiang University of Technology 3 Zhejiang Lab
Pseudocode No The paper describes the overall framework and its components (Video Encoder, Attentional Pooler, Text Decoder) but does not include structured pseudocode or an algorithm block.
Open Source Code Yes Our code and dataset are available at https://github.com/Iriya99/OVRE.
Open Datasets Yes Moreover, we curate the Moments-OVRE dataset, which comprises 180K videos with action-centric relation triplets, sourced from a multi-label action classification dataset. With Moments-OVRE, we further propose a cross-modal mapping model to generate relation triplets as a sequence. Our code and dataset are available at https://github.com/Iriya99/OVRE.
Dataset Splits No The data is partitioned into training and testing sets, resulting in 178,480 and 8,463 videos respectively. The paper does not explicitly state a separate validation set split or its proportions.
Hardware Specification Yes We trained the networks for 50 epochs on 8 Nvidia V100 GPUs
Software Dependencies No The paper mentions software like CLIP, GPT-2, Sim CSE, and Adam W optimizer, but does not provide specific version numbers for these or for programming languages and libraries (e.g., Python, PyTorch, CUDA).
Experiment Setup Yes We train the generation model using cross-entropy loss and employ teacher forcing to accelerate the training process. All models are optimized using Adam W optimizer, with β1 = 0.9, β2 = 0.999, a batch size of 16, and weight decay of 1e-3. The initial learning rate is set to 1e-6 for CLIP, 2e-5 for GPT-2, and 1e-3 for Attention Pooler. We applied learning rate warm-up during the early 5% training steps followed by cosine decay.