Fully Convolutional Video Captioning with Coarse-to-Fine and Inherited Attention

Authors: Kuncheng Fang, Lian Zhou, Cheng Jin, Yuejie Zhang, Kangnian Weng, Tao Zhang, Weiguo Fan8271-8278

AAAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments; Dataset and Evaluation Metrics; Experimental Settings; Experiment Results and Analyses
Researcher Affiliation Academia 1School of Computer Science, Shanghai Key Laboratory of Intelligent Information Processing, Shanghai Institute of Intelligent Electronics & Systems, Fudan University, Shanghai, China; 2School of Information Management and Engineering, Shanghai Key Laboratory of Financial Information Technology, Shanghai University of Finance and Economics, Shanghai, China; 3Department of Management Sciences, Tippie College of Business, University of Iowa, Iowa City, Iowa, USA, 52242
Pseudocode No The paper describes its methods using mathematical equations and textual explanations, but it does not include a dedicated pseudocode or algorithm block.
Open Source Code No The paper does not provide any explicit statement about releasing source code or a link to a code repository for the described methodology.
Open Datasets Yes The Microsoft Video Description Corpus (MSVD) is the most popular benchmark dataset for video captioning (Guadarrama et al. 2013).
Dataset Splits Yes For fair comparison, we follow the commonly utilized setting in our experiments, i.e., 1,200 videos for training, 100 videos for validation and 670 videos for testing.
Hardware Specification No The paper does not provide specific hardware details such as GPU or CPU models, processor types, or memory amounts used for running experiments.
Software Dependencies No The paper mentions using Inception-V3 and C3D models but does not provide specific version numbers for any software dependencies or libraries used in the experiments.
Experiment Setup Yes We sample 15 frames for each video. Inception-V3 (Szegedy et al. 2016) and C3D (Ji et al. 2013) are used to extract features for video representation. The input images are resized to 299 × 299, and thus the dimension of frame features is 2,048. ... The kernel size of 1-D CNN is 5, and the number of stacked layers is 4. We use temporal attention in the first two layers and inherited attention in the last two layers. The head of multi-head attention is 8. The dimension of word-embedding, global feature, frame feature, and region feature are all 512.