Set Prediction Guided by Semantic Concepts for Diverse Video Captioning
Authors: Yifan Lu, Ziqi Zhang, Chunfeng Yuan, Peng Li, Yan Wang, Bing Li, Weiming Hu
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on benchmark datasets show that the proposed SCGSP achieves state-of-the-art (SOTA) performance under both relevance and diversity metrics. |
| Researcher Affiliation | Collaboration | 1State Key Laboratory of Multimodal Artificial Intelligence Systems, CASIA 2School of Artificial Intelligence, University of Chinese Academy of Sciences 3Alibaba Group 4Zhejiang Linkheer Science and Technology Co., Ltd. 5School of Information Science and Technology, Shanghai Tech University |
| Pseudocode | No | The paper describes the model architecture and process in detail but does not include a formal pseudocode block or algorithm. |
| Open Source Code | No | The paper does not contain an explicit statement about releasing the code for the described methodology, nor does it provide a direct link to a source-code repository. |
| Open Datasets | Yes | Datasets MSVD contains 1970 video from You Tube. Each video is annotated with 41 captions on average. We follow the split of 1200/100/670 for training, validation, and test. MSRVTT contains 10000 open domain videos. Each video is annotated with 20 captions. We follow the split of 6513/497/2990 for training, validation, and test. VATEX contains 34991 videos, each with 10 English captions. We follow the split of 25991/3000/6000 for training, validation, and test. |
| Dataset Splits | Yes | We follow the split of 1200/100/670 for training, validation, and test [for MSVD]. We follow the split of 6513/497/2990 for training, validation, and test [for MSRVTT]. We follow the split of 25991/3000/6000 for training, validation, and test [for VATEX]. |
| Hardware Specification | Yes | The model is implemented with Py Torch, and all the experiments are conducted on 1 RTX 3090 GPU. |
| Software Dependencies | No | The paper mentions 'Py Torch' and 'GPT-2' but does not specify their version numbers or the version numbers of any other key software dependencies. |
| Experiment Setup | Yes | The prefix length is set to 10. The weights of loss terms are set as λ = 1 and λd = 0.5. We apply Adam W as the optimizer. The learning rate and batch size are set to 8e 5 and 32 for SCG-SP-LSTM, 1e 5 and 8 for SCG-SP-Prefix. We use beam search with size 3 for generation at the inference stage. |