Comprehensive Visual Grounding for Video Description
Authors: Wenhui Jiang, Yibo Cheng, Linxin Liu, Yuming Fang, Yuxin Peng, Yang Liu
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct extensive experiments on two challenging datasets, and demonstrate significant performance improvements of +2.3 CIDEr on Activity Net-Entities and +2.2 CIDEr on MSR-VTT compared to state-of-the-arts. |
| Researcher Affiliation | Collaboration | 1Jiangxi University of Finance and Economics, Nanchang, China 2Peking University, Beijing, China 3Sany Heavy Industry CO., LTD,China |
| Pseudocode | No | The paper does not contain explicit pseudocode blocks or algorithm listings, only diagrams and descriptive text. |
| Open Source Code | No | The paper does not provide any statement or link regarding the public availability of its source code. |
| Open Datasets | Yes | We conduct our experiments on Activity Net Entities and MSR-VTT. The Activity Net-Entities not only contains the video caption annotation of the video but also provides the box annotation of the noun phrase in the caption. The dataset contains 15,000 videos, including 52,000 video segments, and 1 caption annotation for each video segment. The dataset provides a total of 158,000 valid box annotations of 432 classes. The MSR-VTT contains 10,000 video clips from You Tube. There are 20 human descriptions for each video clip. The dataset contains 6,573 samples for training, 497 samples for validation, and 2,990 for testing. |
| Dataset Splits | Yes | The MSR-VTT contains 10,000 video clips from You Tube. There are 20 human descriptions for each video clip. The dataset contains 6,573 samples for training, 497 samples for validation, and 2,990 for testing. |
| Hardware Specification | No | The paper does not specify the hardware used for experiments, such as specific GPU models or CPU types. |
| Software Dependencies | No | The paper mentions models like Video Swin Transformer and Text4Vis, but does not provide specific version numbers for software dependencies (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | For Activity Net-Entities, we uniformly sample 10 frames for each video segment. For MSR-VTT, 32 video frames are sampled from the video clip. The word embedding size is set to 512. Empirically, λ is set to 0.1. During training, we optimize the model with Adam for 25 epochs. The learning rate is initialized to be 5e-4 and decayed by a factor of 0.8 every three epochs. |