reproducibilityindex.ai

Variational Structured Semantic Inference for Diverse Image Captioning

Authors: Fuhai Chen, Rongrong Ji, Jiayi Ji, Xiaoshuai Sun, Baochang Zhang, Xuri Ge, Yongjian Wu, Feiyue Huang, Yan Wang

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on the benchmark dataset show that the proposed VSSI-cap achieves signiﬁcant improvements over the state-of-the-arts. ... 4 Experiments ... Table 2: Performance comparisons on accuracy of diverse image captioning. ... Table 3: Performance comparisons on diversity.
Researcher Affiliation	Collaboration	Fuhai Chen1, Rongrong Ji12 , Jiayi Ji1, Xiaoshuai Sun1, Baochang Zhang3, Xuri Ge1, Yongjian Wu4, Feiyue Huang4, Yan Wang5 1Department of Artiﬁcial Intelligence, School of Informatics, Xiamen University, 2Peng Cheng Lab, 3Beihang University, 4Tencent Youtu Lab, 5Pinterest
Pseudocode	Yes	DKL can be approximated following [28] (see algorithm ﬂow in supplementary material).
Open Source Code	Yes	The sizes of the entity s, relation s and POS s vocabularies, are 840, 248, and 4, respectively.8 [Footnote 8: https://github.com/cfh3c/NeurIPS19_VPtree_Dics]
Open Datasets	Yes	We conduct all the experiments on the MSCOCO dataset5 [30], which is widely used for image captioning [1, 3] and diverse image captioning [5, 8]. [Footnote 5: http://cocodataset.org/#download]
Dataset Splits	Yes	There are over 93K images in MSCOCO, which has been split into training, testing and validating sets6. [Footnote 6: https://github.com/karpathy/neuraltalk] ... We implement our model training based on the public code9 with the standard data split and the separate z samples.
Hardware Specification	Yes	The overall process takes 37 hours on a NVIDIA Ge Force GTX 1080 Ti GPU with 11GB memory.
Software Dependencies	No	The paper mentions using Stanford Parser [32], NTLK [33] tools, and Python, but does not provide specific version numbers for these software dependencies.
Experiment Setup	Yes	In the proposed Var MI-tree, we set the feature dimension of each node as 512. The dimensions of each mean, each sd, and each latent variable are set as 150. ... We set the word vector dimension as 256 during word embedding. ... All networks are trained with SGD with a learning rate 0.005 for the ﬁrst 5 epochs, and is reduced by half every 5 epochs. On average, all models converge within 50 epochs.