Exploiting Auxiliary Caption for Video Grounding

Authors: Hongxiang Li, Meng Cao, Xuxin Cheng, Yaowei Li, Zhihong Zhu, Yuexian Zou

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on three public datasets (i.e., Activity Net Captions, TACo S and Activity Net-CG) demonstrate that our method significantly outperforms state-of-the-art methods.
Researcher Affiliation Academia Hongxiang Li1, Meng Cao2, Xuxin Cheng1, Yaowei Li1, Zhihong Zhu1, Yuexian Zou1 1School of Electronic and Computer Engineering, Peking University 2International Digital Economy Academy (IDEA) {lihongxiang, chengxx, ywl, zhihongzhu}@stu.pku.edu.cn; {mengcao, zouyx}@pku.edu.cn
Pseudocode Yes Algorithm 1: Non-Auxiliary Caption Suppression (NACS)
Open Source Code No The paper does not explicitly provide a link to open-source code or state that the code for the described methodology is publicly available.
Open Datasets Yes Extensive experiments on three public datasets (i.e., Activity Net Captions, TACo S and Activity Net-CG) demonstrate that our method significantly outperforms state-of-the-art methods. Activity Net Captions (Krishna et al. 2017) contains 20,000 untrimmed videos and 100,000 descriptions from You Tube (Caba Heilbron et al. 2015). TACo S (Regneri et al. 2013) contains 127 videos from the cooking scenarios. Activity Net-CG (Li et al. 2022) aims to evaluate how well a model can generalize to query sentences that contain novel compositions or novel words.
Dataset Splits Yes Activity Net Captions: Following the public split, we use 37417, 17505 and 17031 sentence-video pairs for training, validation and testing, respectively. TACo S: We follow the standard split (Gao et al. 2017), which has 10146, 4589 and 4083 video query pairs for training, validation and testing, respectively. Activity Net-CG: It is a new split of Activity Net Captions, which is re-split into four sets: training, novel-composition, novel-word, and test-trivial.
Hardware Specification No The paper does not explicitly provide specific hardware details such as GPU or CPU models used for running the experiments.
Software Dependencies No The paper mentions using Adam W optimizer and pre-trained BERT but does not provide specific version numbers for these or other software dependencies required for replication.
Experiment Setup Yes During the training, we used Adam W (Loshchilov and Hutter 2018) optimizer to train our model with learning rate of 8 10 4. The batch size B was set to 48 and 8 for Activity Net Captions and TACo S, respectively. We employed the same settings as Activity Net Captions on Activity Net-CG. For the input video, we used exactly the same settings as in the previous work (Wang et al. 2022b) for a fair comparison, including visual features (both C3D features), NMS thresholds (0.5, 0.4), number of sampled clips (64, 128), number of 2D convolution network layers (3, 4) and kernels (4, 2) for Activity Net Captions and TACo S, respectively.