Dual Video Summarization: From Frames to Captions
Authors: Zhenzhen Hu, Zhenshan Wang, Zijie Song, Richang Hong
IJCAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The experiment results on the MSR-VTT and MSVD dataset reveal that, for the generative task as video captioning, a small number of keyframes can convey the same semantic information to perform well on captioning, or even better than the original sampling. |
| Researcher Affiliation | Academia | Zhenzhen Hu1,2 , Zhenshan Wang1 , Zijie Song1 and Richang Hong1 1Hefei University of Technology 2 Institute of Artificial Intelligence Hefei Comprehensive National Science Center |
| Pseudocode | No | The paper describes its framework and process in text and diagrams but does not include any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide any statement or link indicating that the source code for the described methodology is publicly available. |
| Open Datasets | Yes | We evaluate our model on MSR-VTT [Xu et al., 2016] and MSVD [Chen and Dolan, 2011] datasets. |
| Dataset Splits | Yes | We split the data into a 6,513 training set, 497 validation set and 2,990 testing set. We follow the data split of 1,200 videos for training, 100 videos for validation and the rest for testing. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for running the experiments, such as GPU models, CPU types, or memory specifications. |
| Software Dependencies | No | The paper mentions the use of "Adam optimizer" and "pre-trained CLIP [Radford et al., 2021] with 12 layers Vi T-B/32" but does not provide specific version numbers for any software dependencies or libraries. |
| Experiment Setup | Yes | Our summarizer module is trained with 10 epochs on the above datasets with learning rate 1e-3 and dropout 0.2. Our captioning module is trained with learning rate 1e-4 and 40 epochs, and we set the batch size to 32. Both the summarizer and captioning decoder employ Adam optimizer [Kingma and Ba, 2014] to minimize the loss. |