Hierarchical LSTM with Adjusted Temporal Attention for Video Captioning
Authors: Jingkuan Song, Lianli Gao, Zhao Guo, Wu Liu, Dongxiang Zhang, Heng Tao Shen
IJCAI 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To demonstrate the effectiveness of our proposed framework, we test our method on two prevalent datasets: MSVD and MSR-VTT, and experimental results show that our approach outperforms the state-of-the-art methods on both two datasets. |
| Researcher Affiliation | Academia | 1Center for Future Media and School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu 611731, China. 2Beijing University of Posts and Telecommunications, Beijing 100876, China. jingkuan.song@gmail.com, {zhao.guo, lianli.gao, zhangdo}@uestc.edu.cn, liuwu@bupt.edu.cn, shenhengtao@hotmail.com |
| Pseudocode | No | The paper describes the model architecture and equations (e.g., Eq. 3, 4, 5, 6, 11, 12) but does not present any structured pseudocode or algorithm blocks. |
| Open Source Code | No | No statement about providing open-source code for the methodology described in the paper. |
| Open Datasets | Yes | We consider two publicly available datasets that have been widely used in previous work. The Microsoft Video Description Corpus (MSVD). This video corpus consists of 1,970 short video clips, approximately 80,000 description pairs and about 16,000 vocabulary words [Chen and Dolan, 2011]. ... MSR Video to Text (MSR-VTT). In 2016, Xu et al. [Xu et al., 2016] proposed a currently largest video benchmark for video understanding, and especially for video captioning. |
| Dataset Splits | Yes | Following [Yao et al., 2015; Venugopalan et al., 2015], we split the dataset into training, validation and testing set with 1,200, 100 and 670 videos, respectively. |
| Hardware Specification | No | No specific hardware details (e.g., GPU models, CPU models, memory) are mentioned for running the experiments. |
| Software Dependencies | No | No specific version numbers for software components or other dependencies are provided. The paper mentions: "we convert all descriptions to lower cases, and then use wordpunct tokenizer method from NLTK toolbox to tokenize sentences and remove punctuations." and "We adopt adadelta [Zeiler, 2012], which is an adaptive learning rate approach, to optimize our loss function." |
| Experiment Setup | Yes | In addition, all the LSTM unit sizes are set as 512 and the word embedding size is set as 512, empirically. Our objective function Eq. 8 is optimized over the whole training video-sentence pairs with mini-batch 64 in size of MSVD and 256 in size of MSR-VTT. We adopt adadelta [Zeiler, 2012], which is an adaptive learning rate approach, to optimize our loss function. In addition, we utilize dropout regularization with the rate of 0.5 in all layers and clip gradients element wise at 10. We stop training our model until 500 epochs are reached, or until the evaluation metric does not improve on the validation set at the patience of 20. |