reproducibilityindex.ai

Comprehensive Visual Grounding for Video Description

Authors: Wenhui Jiang, Yibo Cheng, Linxin Liu, Yuming Fang, Yuxin Peng, Yang Liu

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct extensive experiments on two challenging datasets, and demonstrate significant performance improvements of +2.3 CIDEr on Activity Net-Entities and +2.2 CIDEr on MSR-VTT compared to state-of-the-arts.
Researcher Affiliation	Collaboration	1Jiangxi University of Finance and Economics, Nanchang, China 2Peking University, Beijing, China 3Sany Heavy Industry CO., LTD,China
Pseudocode	No	The paper does not contain explicit pseudocode blocks or algorithm listings, only diagrams and descriptive text.
Open Source Code	No	The paper does not provide any statement or link regarding the public availability of its source code.
Open Datasets	Yes	We conduct our experiments on Activity Net Entities and MSR-VTT. The Activity Net-Entities not only contains the video caption annotation of the video but also provides the box annotation of the noun phrase in the caption. The dataset contains 15,000 videos, including 52,000 video segments, and 1 caption annotation for each video segment. The dataset provides a total of 158,000 valid box annotations of 432 classes. The MSR-VTT contains 10,000 video clips from You Tube. There are 20 human descriptions for each video clip. The dataset contains 6,573 samples for training, 497 samples for validation, and 2,990 for testing.
Dataset Splits	Yes	The MSR-VTT contains 10,000 video clips from You Tube. There are 20 human descriptions for each video clip. The dataset contains 6,573 samples for training, 497 samples for validation, and 2,990 for testing.
Hardware Specification	No	The paper does not specify the hardware used for experiments, such as specific GPU models or CPU types.
Software Dependencies	No	The paper mentions models like Video Swin Transformer and Text4Vis, but does not provide specific version numbers for software dependencies (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup	Yes	For Activity Net-Entities, we uniformly sample 10 frames for each video segment. For MSR-VTT, 32 video frames are sampled from the video clip. The word embedding size is set to 512. Empirically, λ is set to 0.1. During training, we optimize the model with Adam for 25 epochs. The learning rate is initialized to be 5e-4 and decayed by a factor of 0.8 every three epochs.