Fully Convolutional Video Captioning with Coarse-to-Fine and Inherited Attention
Authors: Kuncheng Fang, Lian Zhou, Cheng Jin, Yuejie Zhang, Kangnian Weng, Tao Zhang, Weiguo Fan8271-8278
AAAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments; Dataset and Evaluation Metrics; Experimental Settings; Experiment Results and Analyses |
| Researcher Affiliation | Academia | 1School of Computer Science, Shanghai Key Laboratory of Intelligent Information Processing, Shanghai Institute of Intelligent Electronics & Systems, Fudan University, Shanghai, China; 2School of Information Management and Engineering, Shanghai Key Laboratory of Financial Information Technology, Shanghai University of Finance and Economics, Shanghai, China; 3Department of Management Sciences, Tippie College of Business, University of Iowa, Iowa City, Iowa, USA, 52242 |
| Pseudocode | No | The paper describes its methods using mathematical equations and textual explanations, but it does not include a dedicated pseudocode or algorithm block. |
| Open Source Code | No | The paper does not provide any explicit statement about releasing source code or a link to a code repository for the described methodology. |
| Open Datasets | Yes | The Microsoft Video Description Corpus (MSVD) is the most popular benchmark dataset for video captioning (Guadarrama et al. 2013). |
| Dataset Splits | Yes | For fair comparison, we follow the commonly utilized setting in our experiments, i.e., 1,200 videos for training, 100 videos for validation and 670 videos for testing. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU or CPU models, processor types, or memory amounts used for running experiments. |
| Software Dependencies | No | The paper mentions using Inception-V3 and C3D models but does not provide specific version numbers for any software dependencies or libraries used in the experiments. |
| Experiment Setup | Yes | We sample 15 frames for each video. Inception-V3 (Szegedy et al. 2016) and C3D (Ji et al. 2013) are used to extract features for video representation. The input images are resized to 299 × 299, and thus the dimension of frame features is 2,048. ... The kernel size of 1-D CNN is 5, and the number of stacked layers is 4. We use temporal attention in the first two layers and inherited attention in the last two layers. The head of multi-head attention is 8. The dimension of word-embedding, global feature, frame feature, and region feature are all 512. |