Semantic Grouping Network for Video Captioning

Authors: Hobin Ryu, Sunghun Kang, Haeyong Kang, Chang D. Yoo2514-2522

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The SGN achieves state-of-the-art performances by outperforming runner-up methods by a margin of 2.1%p and 2.4%p in a CIDEr-D score on MSVD and MSR-VTT datasets, respectively. Extensive experiments demonstrate the effectiveness and interpretability of the SGN.
Researcher Affiliation Academia Hobin Ryu, Sunghun Kang, Haeyong Kang, and Chang D. Yoo Korea Advanced Institute of Science and Technology (KAIST), Daejeon, Republic of Korea {hobincar, sunghun.kang, haeyong.kang, cd yoo}@kaist.ac.kr
Pseudocode Yes Algorithm 1 Phrase Suppression.
Open Source Code Yes 1https://github.com/hobincar/SGN
Open Datasets Yes MSVD. Microsoft Video Description (MSVD) dataset (Chen and Dolan 2011)... For a fair comparison, the dataset is divided into a training set of 1200 videos, a validation set of 100 videos, and a test set of 670 videos by following the official split (Yao et al. 2015). MSR-VTT. MSR Video-to-Text (MSR-VTT) dataset (Xu et al. 2016)... Following Xu et al. (Xu et al. 2016), the dataset is divided into a training set of 6513 videos, a validation set of 497 videos, and a test set of 2990 videos.
Dataset Splits Yes MSVD. Microsoft Video Description (MSVD) dataset (Chen and Dolan 2011)... For a fair comparison, the dataset is divided into a training set of 1200 videos, a validation set of 100 videos, and a test set of 670 videos by following the official split (Yao et al. 2015). MSR-VTT. MSR Video-to-Text (MSR-VTT) dataset (Xu et al. 2016)... Following Xu et al. (Xu et al. 2016), the dataset is divided into a training set of 6513 videos, a validation set of 497 videos, and a test set of 2990 videos.
Hardware Specification Yes On a single Titan V GPU with 12GB of memory, we measured the inference speed of two methods, SGN and TA (Yao et al. 2015) (see Table 4).
Software Dependencies No The paper mentions 'Glo Ve' but does not provide specific version numbers for software dependencies or libraries used in the experiment.
Experiment Setup Yes Implementation Details. We uniformly sample N = 30 frames and clips from each video. As video captioning performances depend on backbone CNNs, various pre-trained CNNs including Goog Le Net (Szegedy et al. 2015), VGGNet (Simonyan and Zisserman 2015), Res Net (He et al. 2016), and 3D-Res Next (Hara, Kataoka, and Satoh 2018) are employed as a Visual Encoder to fairly compare SGN with state-of-the-art methods. The word embedding matrix is initialized using Glo Ve (Pennington, Socher, and Manning 2014) and jointly trained with the whole architecture. Before the first word (w1) is generated, <SOS> is used as the partially decoded caption (i.e., w0 =<SOS>) and then ignored thereafter. τ and λ are set to 0.2 and 0.16 as a result of 5-fold cross-validation for the values of [0.1, 0.2, 0.3] and [0.01, 0.04, 0.16, 0.64], respectively. Beam search with a size of 5 is used for generating the final captions. BLEU@4 (Papineni et al. 2002), CIDEr-D (Vedantam, Lawrence Zitnick, and Parikh 2015), METEOR (Banerjee and Lavie 2005), and ROUGE L (Lin 2004) are used for evaluation, and the scores are computed using the official codes from Microsoft COCO evaluation server (Chen et al. 2015).