Semantic Grouping Network for Video Captioning
Authors: Hobin Ryu, Sunghun Kang, Haeyong Kang, Chang D. Yoo2514-2522
AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The SGN achieves state-of-the-art performances by outperforming runner-up methods by a margin of 2.1%p and 2.4%p in a CIDEr-D score on MSVD and MSR-VTT datasets, respectively. Extensive experiments demonstrate the effectiveness and interpretability of the SGN. |
| Researcher Affiliation | Academia | Hobin Ryu, Sunghun Kang, Haeyong Kang, and Chang D. Yoo Korea Advanced Institute of Science and Technology (KAIST), Daejeon, Republic of Korea {hobincar, sunghun.kang, haeyong.kang, cd yoo}@kaist.ac.kr |
| Pseudocode | Yes | Algorithm 1 Phrase Suppression. |
| Open Source Code | Yes | 1https://github.com/hobincar/SGN |
| Open Datasets | Yes | MSVD. Microsoft Video Description (MSVD) dataset (Chen and Dolan 2011)... For a fair comparison, the dataset is divided into a training set of 1200 videos, a validation set of 100 videos, and a test set of 670 videos by following the official split (Yao et al. 2015). MSR-VTT. MSR Video-to-Text (MSR-VTT) dataset (Xu et al. 2016)... Following Xu et al. (Xu et al. 2016), the dataset is divided into a training set of 6513 videos, a validation set of 497 videos, and a test set of 2990 videos. |
| Dataset Splits | Yes | MSVD. Microsoft Video Description (MSVD) dataset (Chen and Dolan 2011)... For a fair comparison, the dataset is divided into a training set of 1200 videos, a validation set of 100 videos, and a test set of 670 videos by following the official split (Yao et al. 2015). MSR-VTT. MSR Video-to-Text (MSR-VTT) dataset (Xu et al. 2016)... Following Xu et al. (Xu et al. 2016), the dataset is divided into a training set of 6513 videos, a validation set of 497 videos, and a test set of 2990 videos. |
| Hardware Specification | Yes | On a single Titan V GPU with 12GB of memory, we measured the inference speed of two methods, SGN and TA (Yao et al. 2015) (see Table 4). |
| Software Dependencies | No | The paper mentions 'Glo Ve' but does not provide specific version numbers for software dependencies or libraries used in the experiment. |
| Experiment Setup | Yes | Implementation Details. We uniformly sample N = 30 frames and clips from each video. As video captioning performances depend on backbone CNNs, various pre-trained CNNs including Goog Le Net (Szegedy et al. 2015), VGGNet (Simonyan and Zisserman 2015), Res Net (He et al. 2016), and 3D-Res Next (Hara, Kataoka, and Satoh 2018) are employed as a Visual Encoder to fairly compare SGN with state-of-the-art methods. The word embedding matrix is initialized using Glo Ve (Pennington, Socher, and Manning 2014) and jointly trained with the whole architecture. Before the first word (w1) is generated, <SOS> is used as the partially decoded caption (i.e., w0 =<SOS>) and then ignored thereafter. τ and λ are set to 0.2 and 0.16 as a result of 5-fold cross-validation for the values of [0.1, 0.2, 0.3] and [0.01, 0.04, 0.16, 0.64], respectively. Beam search with a size of 5 is used for generating the final captions. BLEU@4 (Papineni et al. 2002), CIDEr-D (Vedantam, Lawrence Zitnick, and Parikh 2015), METEOR (Banerjee and Lavie 2005), and ROUGE L (Lin 2004) are used for evaluation, and the scores are computed using the official codes from Microsoft COCO evaluation server (Chen et al. 2015). |