Hierarchical LSTM with Adjusted Temporal Attention for Video Captioning

Authors: Jingkuan Song, Lianli Gao, Zhao Guo, Wu Liu, Dongxiang Zhang, Heng Tao Shen

IJCAI 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To demonstrate the effectiveness of our proposed framework, we test our method on two prevalent datasets: MSVD and MSR-VTT, and experimental results show that our approach outperforms the state-of-the-art methods on both two datasets.
Researcher Affiliation Academia 1Center for Future Media and School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu 611731, China. 2Beijing University of Posts and Telecommunications, Beijing 100876, China. jingkuan.song@gmail.com, {zhao.guo, lianli.gao, zhangdo}@uestc.edu.cn, liuwu@bupt.edu.cn, shenhengtao@hotmail.com
Pseudocode No The paper describes the model architecture and equations (e.g., Eq. 3, 4, 5, 6, 11, 12) but does not present any structured pseudocode or algorithm blocks.
Open Source Code No No statement about providing open-source code for the methodology described in the paper.
Open Datasets Yes We consider two publicly available datasets that have been widely used in previous work. The Microsoft Video Description Corpus (MSVD). This video corpus consists of 1,970 short video clips, approximately 80,000 description pairs and about 16,000 vocabulary words [Chen and Dolan, 2011]. ... MSR Video to Text (MSR-VTT). In 2016, Xu et al. [Xu et al., 2016] proposed a currently largest video benchmark for video understanding, and especially for video captioning.
Dataset Splits Yes Following [Yao et al., 2015; Venugopalan et al., 2015], we split the dataset into training, validation and testing set with 1,200, 100 and 670 videos, respectively.
Hardware Specification No No specific hardware details (e.g., GPU models, CPU models, memory) are mentioned for running the experiments.
Software Dependencies No No specific version numbers for software components or other dependencies are provided. The paper mentions: "we convert all descriptions to lower cases, and then use wordpunct tokenizer method from NLTK toolbox to tokenize sentences and remove punctuations." and "We adopt adadelta [Zeiler, 2012], which is an adaptive learning rate approach, to optimize our loss function."
Experiment Setup Yes In addition, all the LSTM unit sizes are set as 512 and the word embedding size is set as 512, empirically. Our objective function Eq. 8 is optimized over the whole training video-sentence pairs with mini-batch 64 in size of MSVD and 256 in size of MSR-VTT. We adopt adadelta [Zeiler, 2012], which is an adaptive learning rate approach, to optimize our loss function. In addition, we utilize dropout regularization with the rate of 0.5 in all layers and clip gradients element wise at 10. We stop training our model until 500 epochs are reached, or until the evaluation metric does not improve on the validation set at the patience of 20.