reproducibilityindex.ai

Semantic Grouping Network for Video Captioning

Authors: Hobin Ryu, Sunghun Kang, Haeyong Kang, Chang D. Yoo2514-2522

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	The SGN achieves state-of-the-art performances by outperforming runner-up methods by a margin of 2.1%p and 2.4%p in a CIDEr-D score on MSVD and MSR-VTT datasets, respectively. Extensive experiments demonstrate the effectiveness and interpretability of the SGN.
Researcher Affiliation	Academia	Hobin Ryu, Sunghun Kang, Haeyong Kang, and Chang D. Yoo Korea Advanced Institute of Science and Technology (KAIST), Daejeon, Republic of Korea {hobincar, sunghun.kang, haeyong.kang, cd yoo}@kaist.ac.kr
Pseudocode	Yes	Algorithm 1 Phrase Suppression.
Open Source Code	Yes	1https://github.com/hobincar/SGN
Open Datasets	Yes	MSVD. Microsoft Video Description (MSVD) dataset (Chen and Dolan 2011)... For a fair comparison, the dataset is divided into a training set of 1200 videos, a validation set of 100 videos, and a test set of 670 videos by following the ofﬁcial split (Yao et al. 2015). MSR-VTT. MSR Video-to-Text (MSR-VTT) dataset (Xu et al. 2016)... Following Xu et al. (Xu et al. 2016), the dataset is divided into a training set of 6513 videos, a validation set of 497 videos, and a test set of 2990 videos.
Dataset Splits	Yes	MSVD. Microsoft Video Description (MSVD) dataset (Chen and Dolan 2011)... For a fair comparison, the dataset is divided into a training set of 1200 videos, a validation set of 100 videos, and a test set of 670 videos by following the ofﬁcial split (Yao et al. 2015). MSR-VTT. MSR Video-to-Text (MSR-VTT) dataset (Xu et al. 2016)... Following Xu et al. (Xu et al. 2016), the dataset is divided into a training set of 6513 videos, a validation set of 497 videos, and a test set of 2990 videos.
Hardware Specification	Yes	On a single Titan V GPU with 12GB of memory, we measured the inference speed of two methods, SGN and TA (Yao et al. 2015) (see Table 4).
Software Dependencies	No	The paper mentions 'Glo Ve' but does not provide specific version numbers for software dependencies or libraries used in the experiment.
Experiment Setup	Yes	Implementation Details. We uniformly sample N = 30 frames and clips from each video. As video captioning performances depend on backbone CNNs, various pre-trained CNNs including Goog Le Net (Szegedy et al. 2015), VGGNet (Simonyan and Zisserman 2015), Res Net (He et al. 2016), and 3D-Res Next (Hara, Kataoka, and Satoh 2018) are employed as a Visual Encoder to fairly compare SGN with state-of-the-art methods. The word embedding matrix is initialized using Glo Ve (Pennington, Socher, and Manning 2014) and jointly trained with the whole architecture. Before the ﬁrst word (w1) is generated, <SOS> is used as the partially decoded caption (i.e., w0 =<SOS>) and then ignored thereafter. τ and λ are set to 0.2 and 0.16 as a result of 5-fold cross-validation for the values of [0.1, 0.2, 0.3] and [0.01, 0.04, 0.16, 0.64], respectively. Beam search with a size of 5 is used for generating the ﬁnal captions. BLEU@4 (Papineni et al. 2002), CIDEr-D (Vedantam, Lawrence Zitnick, and Parikh 2015), METEOR (Banerjee and Lavie 2005), and ROUGE L (Lin 2004) are used for evaluation, and the scores are computed using the ofﬁcial codes from Microsoft COCO evaluation server (Chen et al. 2015).