Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Dual Video Summarization: From Frames to Captions
Authors: Zhenzhen Hu, Zhenshan Wang, Zijie Song, Richang Hong
IJCAI 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The experiment results on the MSR-VTT and MSVD dataset reveal that, for the generative task as video captioning, a small number of keyframes can convey the same semantic information to perform well on captioning, or even better than the original sampling. |
| Researcher Affiliation | Academia | Zhenzhen Hu1,2 , Zhenshan Wang1 , Zijie Song1 and Richang Hong1 1Hefei University of Technology 2 Institute of Artificial Intelligence Hefei Comprehensive National Science Center |
| Pseudocode | No | The paper describes its framework and process in text and diagrams but does not include any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide any statement or link indicating that the source code for the described methodology is publicly available. |
| Open Datasets | Yes | We evaluate our model on MSR-VTT [Xu et al., 2016] and MSVD [Chen and Dolan, 2011] datasets. |
| Dataset Splits | Yes | We split the data into a 6,513 training set, 497 validation set and 2,990 testing set. We follow the data split of 1,200 videos for training, 100 videos for validation and the rest for testing. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for running the experiments, such as GPU models, CPU types, or memory specifications. |
| Software Dependencies | No | The paper mentions the use of "Adam optimizer" and "pre-trained CLIP [Radford et al., 2021] with 12 layers Vi T-B/32" but does not provide specific version numbers for any software dependencies or libraries. |
| Experiment Setup | Yes | Our summarizer module is trained with 10 epochs on the above datasets with learning rate 1e-3 and dropout 0.2. Our captioning module is trained with learning rate 1e-4 and 40 epochs, and we set the batch size to 32. Both the summarizer and captioning decoder employ Adam optimizer [Kingma and Ba, 2014] to minimize the loss. |