Video Captioning with Listwise Supervision

Authors: Yuan Liu, Xue Li, Zhongchao Shi

AAAI 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The experiments on MSVD dataset show that our proposed LSTM-LS produces better performance than the state of the art in generating natural sentences: 51.1% and 32.6% in terms of BLEU@4 and METEOR, respectively. Superior performances are also reported on the movie description M-VAD dataset.
Researcher Affiliation Industry Yuan Liu, Xue Li, Zhongchao Shi Ricoh Software Research Center (Beijing) Co., Ltd., Beijing, China Ricoh Company, Ltd., Yokohama, Japan yuanliu.ustc@gmail.com, Setsu.Ri@nts.ricoh.co.jp, Zhongchao.Shi@srcb.ricoh.com
Pseudocode No The paper describes the model's equations and procedures but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code No The paper does not provide any explicit statements about open-sourcing the code or links to a code repository.
Open Datasets Yes We evaluate and compare our proposed LSTM-LS with state-of-the-art approaches by conducting video captioning on two benchmarks, i.e., Microsoft Research Video Description Corpus (MSVD) (Chen and Dolan 2011) and Montreal Video Annotation Dataset (M-VAD) (Torabi et al. 2015).
Dataset Splits Yes In our experiments, we follow the setting used in prior works (Guadarrama et al. 2013; Pan et al. 2016a), taking 1,200 videos for training, 100 for validation and 670 for testing.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments.
Software Dependencies No The paper mentions models and architectures like VGG and C3D but does not specify software dependencies (e.g., libraries, frameworks) with version numbers.
Experiment Setup Yes In the experiment, we compare our LSTMLS approach with one 2-D CNN of 19-layer VGG (Simonyan and Zisserman 2015) network pre-trained on Imagenet ILSVRC12 dataset (Russakovsky et al. 2015), and one 3-D CNN of C3D (Tran et al. 2015) pre-trained on Sports1M video dataset (Karpathy et al. 2014). Specifically, we take the output of 4096-way fc6 layer from the 19-layer VGG and 4096-way fc6 layer from C3D as the frame and clip representation, respectively. The size of hidden layer in LSTM is set to 1,024. The number of nearest sentences K is empirically set to 4.