GL-RG: Global-Local Representation Granularity for Video Captioning

Authors: Liqi Yan, Qifan Wang, Yiming Cui, Fuli Feng, Xiaojun Quan, Xiangyu Zhang, Dongfang Liu

IJCAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results on the challenging MSR-VTT and MSVD datasets show that our DL-RG outperforms recent state-of-the-art methods by a significant margin.
Researcher Affiliation Collaboration Liqi Yan1,2,8 , Qifan Wang3 , Yiming Cui4 , Fuli Feng5 , Xiaojun Quan6 , Xiangyu Zhang7 and Dongfang Liu8 1Fudan University 2Westlake University 3Meta AI 4University of Florida 5University of Science and Technology of China 6Sun Yat-sen University 7Purdue University 8Rochester Institute of Technology
Pseudocode No The paper includes equations and architectural diagrams, but no explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code Yes Code is available at https://github.com/ylqi/GL-RG.
Open Datasets Yes We evaluate our GL-RG on MSR-VTT dataset [Xu et al., 2016]. We also evaluate our GL-RG on the MSVD dataset [Chen and Dolan, 2011].
Dataset Splits Yes We follow the data split of 6513 videos for training, 497 videos for validation, and 2990 videos for testing. We split the dataset into a 1,200 training set, 100 validation set, and 670 testing set by the contiguous index number.
Hardware Specification No The paper does not provide specific details about the hardware used for running experiments, such as GPU/CPU models, memory, or processing units.
Software Dependencies No The paper does not specify version numbers for any software dependencies or libraries used in the implementation, only mentioning the use of pre-trained models on certain datasets.
Experiment Setup Yes Our decoder is trained with the learning rate of 0.0003 in the seeding phase, and 0.0001 in the boosting phase. For each video, training is operated on 20 or 17 ground-truth captions for MSR-VTT or MSVD respectively.