Motion Guided Spatial Attention for Video Captioning
Authors: Shaoxiang Chen, Yu-Gang Jiang8191-8198
AAAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our approach on two benchmark datasets, MSVD and MSR-VTT. The experiments show that our designed model can generate better video representation and state of the art results are obtained under popular evaluation metrics such as BLEU@4, CIDEr, and METEOR. |
| Researcher Affiliation | Academia | Shanghai Key Lab of Intelligent Information Processing, School of Computer Science, Fudan University Shanghai Institute of Intelligent Electronics & Systems {sxchen13, ygj}@fudan.edu.cn |
| Pseudocode | No | The paper describes the architecture and computations in text and diagrams (Figure 2) but does not include structured pseudocode or algorithm blocks. |
| Open Source Code | No | All the components of our model and training are implemented in Tensorflow3. 3https://github.com/tensorflow/tensorflow. The provided link is to the TensorFlow library, not the authors' specific implementation code for this paper. The paper does not state that their code is open source or available. |
| Open Datasets | Yes | The MSVD dataset (Chen and Dolan 2011) is a widely used benchmark dataset for video captioning methods. The MSR-VTT dataset (Xu et al. 2016) is a large scale open-domain video captioning dataset. |
| Dataset Splits | Yes | In our experiments, we follow the split settings in prior works (Xu et al. 2017; Yao et al. 2015): 1,200 videos for training, 100 videos for validation and 670 videos for testing. We follow the standard dataset split in the dataset paper: 6,513 video for training, 497 videos for validation and 2,990 videos for testing. |
| Hardware Specification | Yes | On a commodity GTX 1080 Ti GPU, the times needed to extract frame features and optical flows for a typical 10-second video clip are 400ms and 800ms, respectively. |
| Software Dependencies | No | All the components of our model and training are implemented in Tensorflow3. The paper mentions TensorFlow but does not specify a version number or other software dependencies with versions. |
| Experiment Setup | Yes | The LSTMs used in our model all have 1024 hidden units and the word embedding size is set to 512. ... We apply dropout with rate of 0.5 to all the vertical connections of LSTMs and L2 regularization with a factor of 5 10 5 to all the trainable parameters to mitigate overfitting. We apply ADAM optimizer with a learning rate of 10 4 and batch size of 32 to minimize the negative log-likelihood loss. |