reproducibilityindex.ai

Video Captioning with Listwise Supervision

Authors: Yuan Liu, Xue Li, Zhongchao Shi

AAAI 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	The experiments on MSVD dataset show that our proposed LSTM-LS produces better performance than the state of the art in generating natural sentences: 51.1% and 32.6% in terms of BLEU@4 and METEOR, respectively. Superior performances are also reported on the movie description M-VAD dataset.
Researcher Affiliation	Industry	Yuan Liu, Xue Li, Zhongchao Shi Ricoh Software Research Center (Beijing) Co., Ltd., Beijing, China Ricoh Company, Ltd., Yokohama, Japan yuanliu.ustc@gmail.com, Setsu.Ri@nts.ricoh.co.jp, Zhongchao.Shi@srcb.ricoh.com
Pseudocode	No	The paper describes the model's equations and procedures but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code	No	The paper does not provide any explicit statements about open-sourcing the code or links to a code repository.
Open Datasets	Yes	We evaluate and compare our proposed LSTM-LS with state-of-the-art approaches by conducting video captioning on two benchmarks, i.e., Microsoft Research Video Description Corpus (MSVD) (Chen and Dolan 2011) and Montreal Video Annotation Dataset (M-VAD) (Torabi et al. 2015).
Dataset Splits	Yes	In our experiments, we follow the setting used in prior works (Guadarrama et al. 2013; Pan et al. 2016a), taking 1,200 videos for training, 100 for validation and 670 for testing.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments.
Software Dependencies	No	The paper mentions models and architectures like VGG and C3D but does not specify software dependencies (e.g., libraries, frameworks) with version numbers.
Experiment Setup	Yes	In the experiment, we compare our LSTMLS approach with one 2-D CNN of 19-layer VGG (Simonyan and Zisserman 2015) network pre-trained on Imagenet ILSVRC12 dataset (Russakovsky et al. 2015), and one 3-D CNN of C3D (Tran et al. 2015) pre-trained on Sports1M video dataset (Karpathy et al. 2014). Speciﬁcally, we take the output of 4096-way fc6 layer from the 19-layer VGG and 4096-way fc6 layer from C3D as the frame and clip representation, respectively. The size of hidden layer in LSTM is set to 1,024. The number of nearest sentences K is empirically set to 4.