Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Video Captioning with Listwise Supervision
Authors: Yuan Liu, Xue Li, Zhongchao Shi
AAAI 2017 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The experiments on MSVD dataset show that our proposed LSTM-LS produces better performance than the state of the art in generating natural sentences: 51.1% and 32.6% in terms of BLEU@4 and METEOR, respectively. Superior performances are also reported on the movie description M-VAD dataset. |
| Researcher Affiliation | Industry | Yuan Liu, Xue Li, Zhongchao Shi Ricoh Software Research Center (Beijing) Co., Ltd., Beijing, China Ricoh Company, Ltd., Yokohama, Japan EMAIL, EMAIL, EMAIL |
| Pseudocode | No | The paper describes the model's equations and procedures but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks. |
| Open Source Code | No | The paper does not provide any explicit statements about open-sourcing the code or links to a code repository. |
| Open Datasets | Yes | We evaluate and compare our proposed LSTM-LS with state-of-the-art approaches by conducting video captioning on two benchmarks, i.e., Microsoft Research Video Description Corpus (MSVD) (Chen and Dolan 2011) and Montreal Video Annotation Dataset (M-VAD) (Torabi et al. 2015). |
| Dataset Splits | Yes | In our experiments, we follow the setting used in prior works (Guadarrama et al. 2013; Pan et al. 2016a), taking 1,200 videos for training, 100 for validation and 670 for testing. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments. |
| Software Dependencies | No | The paper mentions models and architectures like VGG and C3D but does not specify software dependencies (e.g., libraries, frameworks) with version numbers. |
| Experiment Setup | Yes | In the experiment, we compare our LSTMLS approach with one 2-D CNN of 19-layer VGG (Simonyan and Zisserman 2015) network pre-trained on Imagenet ILSVRC12 dataset (Russakovsky et al. 2015), and one 3-D CNN of C3D (Tran et al. 2015) pre-trained on Sports1M video dataset (Karpathy et al. 2014). Specifically, we take the output of 4096-way fc6 layer from the 19-layer VGG and 4096-way fc6 layer from C3D as the frame and clip representation, respectively. The size of hidden layer in LSTM is set to 1,024. The number of nearest sentences K is empirically set to 4. |