Variational Structured Semantic Inference for Diverse Image Captioning
Authors: Fuhai Chen, Rongrong Ji, Jiayi Ji, Xiaoshuai Sun, Baochang Zhang, Xuri Ge, Yongjian Wu, Feiyue Huang, Yan Wang
NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on the benchmark dataset show that the proposed VSSI-cap achieves significant improvements over the state-of-the-arts. ... 4 Experiments ... Table 2: Performance comparisons on accuracy of diverse image captioning. ... Table 3: Performance comparisons on diversity. |
| Researcher Affiliation | Collaboration | Fuhai Chen1, Rongrong Ji12 , Jiayi Ji1, Xiaoshuai Sun1, Baochang Zhang3, Xuri Ge1, Yongjian Wu4, Feiyue Huang4, Yan Wang5 1Department of Artificial Intelligence, School of Informatics, Xiamen University, 2Peng Cheng Lab, 3Beihang University, 4Tencent Youtu Lab, 5Pinterest |
| Pseudocode | Yes | DKL can be approximated following [28] (see algorithm flow in supplementary material). |
| Open Source Code | Yes | The sizes of the entity s, relation s and POS s vocabularies, are 840, 248, and 4, respectively.8 [Footnote 8: https://github.com/cfh3c/NeurIPS19_VPtree_Dics] |
| Open Datasets | Yes | We conduct all the experiments on the MSCOCO dataset5 [30], which is widely used for image captioning [1, 3] and diverse image captioning [5, 8]. [Footnote 5: http://cocodataset.org/#download] |
| Dataset Splits | Yes | There are over 93K images in MSCOCO, which has been split into training, testing and validating sets6. [Footnote 6: https://github.com/karpathy/neuraltalk] ... We implement our model training based on the public code9 with the standard data split and the separate z samples. |
| Hardware Specification | Yes | The overall process takes 37 hours on a NVIDIA Ge Force GTX 1080 Ti GPU with 11GB memory. |
| Software Dependencies | No | The paper mentions using Stanford Parser [32], NTLK [33] tools, and Python, but does not provide specific version numbers for these software dependencies. |
| Experiment Setup | Yes | In the proposed Var MI-tree, we set the feature dimension of each node as 512. The dimensions of each mean, each sd, and each latent variable are set as 150. ... We set the word vector dimension as 256 during word embedding. ... All networks are trained with SGD with a learning rate 0.005 for the first 5 epochs, and is reduced by half every 5 epochs. On average, all models converge within 50 epochs. |