Knowledgeable Storyteller: A Commonsense-Driven Generative Model for Visual Storytelling

Authors: Pengcheng Yang, Fuli Luo, Peng Chen, Lei Li, Zhiyi Yin, Xiaodong He, Xu Sun

IJCAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct experiments on the VIST dataset [Huang et al., 2016], which consists of 10,117 Flickr albums and 210,819 unique photos. We follow the standard split [Wang et al., 2018b] for a fair comparison. Automatic evaluation. The automatic evaluation of visual storytelling remains an open and tricky question since this task is highly flexible and stories are very subjective. Therefore, we adopt a combination of multiple evaluation metrics, including BLEU, ROUGE, METEOR, and CIDEr. Human evaluation. We also conduct human evaluation to more accurately assess the quality of the output.
Researcher Affiliation Collaboration Pengcheng Yang1,2 , Fuli Luo2 , Peng Chen2 , Lei Li2 , Zhiyi Yin2 , Xiaodong He3 , Xu Sun1,2 1Deep Learning Lab, Beijing Institute of Big Data Research, Peking University 2MOE Key Lab of Computational Linguistics, School of EECS, Peking University 3JD AI Research, China
Pseudocode No The paper describes the model architecture and processes mathematically and textually but does not include any pseudocode blocks or algorithms labeled as such.
Open Source Code Yes The code is available at https://github.com/lancopku/CVST
Open Datasets Yes We conduct experiments on the VIST dataset [Huang et al., 2016], which consists of 10,117 Flickr albums and 210,819 unique photos.
Dataset Splits Yes We follow the standard split [Wang et al., 2018b] for a fair comparison.
Hardware Specification No The paper mentions using a Res Net-152 model for visual features and GRU models, but it does not specify the hardware (e.g., GPU model, CPU, memory) used for training or inference.
Software Dependencies No The paper mentions using Res Net-152 and Adam optimizer, but does not provide specific version numbers for software components like Python, PyTorch/TensorFlow, or CUDA libraries.
Experiment Setup Yes We set the batch size to 64 and the vocabulary size is 30,000. The 512-dim word embeddings are learned from scratch. We apply the Res Net-152 [He et al., 2016] pre-trained on the Image Net to extract visual features. All GRU models are set to two layers, and the hidden size is 512. Except that the decoder is unidirectional, the other GRU models are bidirectional. The parameter λ is set to 0.05. We use the Adam [Kingma and Ba, 2014] optimizer with the initial learning rate 10 3.