Variational Structured Semantic Inference for Diverse Image Captioning

Authors: Fuhai Chen, Rongrong Ji, Jiayi Ji, Xiaoshuai Sun, Baochang Zhang, Xuri Ge, Yongjian Wu, Feiyue Huang, Yan Wang

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on the benchmark dataset show that the proposed VSSI-cap achieves significant improvements over the state-of-the-arts. ... 4 Experiments ... Table 2: Performance comparisons on accuracy of diverse image captioning. ... Table 3: Performance comparisons on diversity.
Researcher Affiliation Collaboration Fuhai Chen1, Rongrong Ji12 , Jiayi Ji1, Xiaoshuai Sun1, Baochang Zhang3, Xuri Ge1, Yongjian Wu4, Feiyue Huang4, Yan Wang5 1Department of Artificial Intelligence, School of Informatics, Xiamen University, 2Peng Cheng Lab, 3Beihang University, 4Tencent Youtu Lab, 5Pinterest
Pseudocode Yes DKL can be approximated following [28] (see algorithm flow in supplementary material).
Open Source Code Yes The sizes of the entity s, relation s and POS s vocabularies, are 840, 248, and 4, respectively.8 [Footnote 8: https://github.com/cfh3c/NeurIPS19_VPtree_Dics]
Open Datasets Yes We conduct all the experiments on the MSCOCO dataset5 [30], which is widely used for image captioning [1, 3] and diverse image captioning [5, 8]. [Footnote 5: http://cocodataset.org/#download]
Dataset Splits Yes There are over 93K images in MSCOCO, which has been split into training, testing and validating sets6. [Footnote 6: https://github.com/karpathy/neuraltalk] ... We implement our model training based on the public code9 with the standard data split and the separate z samples.
Hardware Specification Yes The overall process takes 37 hours on a NVIDIA Ge Force GTX 1080 Ti GPU with 11GB memory.
Software Dependencies No The paper mentions using Stanford Parser [32], NTLK [33] tools, and Python, but does not provide specific version numbers for these software dependencies.
Experiment Setup Yes In the proposed Var MI-tree, we set the feature dimension of each node as 512. The dimensions of each mean, each sd, and each latent variable are set as 150. ... We set the word vector dimension as 256 during word embedding. ... All networks are trained with SGD with a learning rate 0.005 for the first 5 epochs, and is reduced by half every 5 epochs. On average, all models converge within 50 epochs.