Exploiting Auxiliary Caption for Video Grounding
Authors: Hongxiang Li, Meng Cao, Xuxin Cheng, Yaowei Li, Zhihong Zhu, Yuexian Zou
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on three public datasets (i.e., Activity Net Captions, TACo S and Activity Net-CG) demonstrate that our method significantly outperforms state-of-the-art methods. |
| Researcher Affiliation | Academia | Hongxiang Li1, Meng Cao2, Xuxin Cheng1, Yaowei Li1, Zhihong Zhu1, Yuexian Zou1 1School of Electronic and Computer Engineering, Peking University 2International Digital Economy Academy (IDEA) {lihongxiang, chengxx, ywl, zhihongzhu}@stu.pku.edu.cn; {mengcao, zouyx}@pku.edu.cn |
| Pseudocode | Yes | Algorithm 1: Non-Auxiliary Caption Suppression (NACS) |
| Open Source Code | No | The paper does not explicitly provide a link to open-source code or state that the code for the described methodology is publicly available. |
| Open Datasets | Yes | Extensive experiments on three public datasets (i.e., Activity Net Captions, TACo S and Activity Net-CG) demonstrate that our method significantly outperforms state-of-the-art methods. Activity Net Captions (Krishna et al. 2017) contains 20,000 untrimmed videos and 100,000 descriptions from You Tube (Caba Heilbron et al. 2015). TACo S (Regneri et al. 2013) contains 127 videos from the cooking scenarios. Activity Net-CG (Li et al. 2022) aims to evaluate how well a model can generalize to query sentences that contain novel compositions or novel words. |
| Dataset Splits | Yes | Activity Net Captions: Following the public split, we use 37417, 17505 and 17031 sentence-video pairs for training, validation and testing, respectively. TACo S: We follow the standard split (Gao et al. 2017), which has 10146, 4589 and 4083 video query pairs for training, validation and testing, respectively. Activity Net-CG: It is a new split of Activity Net Captions, which is re-split into four sets: training, novel-composition, novel-word, and test-trivial. |
| Hardware Specification | No | The paper does not explicitly provide specific hardware details such as GPU or CPU models used for running the experiments. |
| Software Dependencies | No | The paper mentions using Adam W optimizer and pre-trained BERT but does not provide specific version numbers for these or other software dependencies required for replication. |
| Experiment Setup | Yes | During the training, we used Adam W (Loshchilov and Hutter 2018) optimizer to train our model with learning rate of 8 10 4. The batch size B was set to 48 and 8 for Activity Net Captions and TACo S, respectively. We employed the same settings as Activity Net Captions on Activity Net-CG. For the input video, we used exactly the same settings as in the previous work (Wang et al. 2022b) for a fair comparison, including visual features (both C3D features), NMS thresholds (0.5, 0.4), number of sampled clips (64, 128), number of 2D convolution network layers (3, 4) and kernels (4, 2) for Activity Net Captions and TACo S, respectively. |