Video Captioning with Listwise Supervision
Authors: Yuan Liu, Xue Li, Zhongchao Shi
AAAI 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The experiments on MSVD dataset show that our proposed LSTM-LS produces better performance than the state of the art in generating natural sentences: 51.1% and 32.6% in terms of BLEU@4 and METEOR, respectively. Superior performances are also reported on the movie description M-VAD dataset. |
| Researcher Affiliation | Industry | Yuan Liu, Xue Li, Zhongchao Shi Ricoh Software Research Center (Beijing) Co., Ltd., Beijing, China Ricoh Company, Ltd., Yokohama, Japan yuanliu.ustc@gmail.com, Setsu.Ri@nts.ricoh.co.jp, Zhongchao.Shi@srcb.ricoh.com |
| Pseudocode | No | The paper describes the model's equations and procedures but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks. |
| Open Source Code | No | The paper does not provide any explicit statements about open-sourcing the code or links to a code repository. |
| Open Datasets | Yes | We evaluate and compare our proposed LSTM-LS with state-of-the-art approaches by conducting video captioning on two benchmarks, i.e., Microsoft Research Video Description Corpus (MSVD) (Chen and Dolan 2011) and Montreal Video Annotation Dataset (M-VAD) (Torabi et al. 2015). |
| Dataset Splits | Yes | In our experiments, we follow the setting used in prior works (Guadarrama et al. 2013; Pan et al. 2016a), taking 1,200 videos for training, 100 for validation and 670 for testing. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments. |
| Software Dependencies | No | The paper mentions models and architectures like VGG and C3D but does not specify software dependencies (e.g., libraries, frameworks) with version numbers. |
| Experiment Setup | Yes | In the experiment, we compare our LSTMLS approach with one 2-D CNN of 19-layer VGG (Simonyan and Zisserman 2015) network pre-trained on Imagenet ILSVRC12 dataset (Russakovsky et al. 2015), and one 3-D CNN of C3D (Tran et al. 2015) pre-trained on Sports1M video dataset (Karpathy et al. 2014). Specifically, we take the output of 4096-way fc6 layer from the 19-layer VGG and 4096-way fc6 layer from C3D as the frame and clip representation, respectively. The size of hidden layer in LSTM is set to 1,024. The number of nearest sentences K is empirically set to 4. |