Storytelling from an Image Stream Using Scene Graphs

Authors: Ruize Wang, Zhongyu Wei, Piji Li, Qi Zhang, Xuanjing Huang9185-9192

AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments are conducted on the public visual storytelling dataset. Automatic and human evaluation results indicate that our method achieves state-of-the-art.
Researcher Affiliation Collaboration Ruize Wang,1 Zhongyu Wei,2,4 Piji Li,5 Qi Zhang,3 Xuanjing Huang3 1Academy for Engineering and Technology, Fudan University, China 2School of Data Science, Fudan University, China 3School of Computer Science, Fudan University, China 4Research Institute of Intelligent and Complex Systems, Fudan University, China 5Tencent AI Lab, China
Pseudocode No The paper describes the model architecture and mathematical formulations, but does not provide structured pseudocode or algorithm blocks.
Open Source Code No The paper mentions that 'AREL (Wang et al. 2018b) is trained and evaluated according to its publicly available code', referring to another model, but does not provide any statement or link for the open-source code of their own proposed method (SGVST).
Open Datasets Yes Datasets. VIST (Huang et al. 2016) dataset includes 10,117 Flicker albums with 210,819 images.
Dataset Splits Yes Thus, the samples have been split into three parts, 40,098 for training, 4,988 for validation and 5,050 for testing, respectively.
Hardware Specification No The paper mentions software components and training parameters but does not provide specific hardware details such as GPU or CPU models used for running the experiments.
Software Dependencies No The paper mentions using 'Faster RCNN' and 'MOTIFS' as detectors, and 'Adam' as an optimizer, but does not provide specific version numbers for these or any other software dependencies, making replication difficult.
Experiment Setup Yes In Multi-modal Graph Conv Net, we use a 5 layers GCN, whose the input and output dimension both as 512; for TCN, we set the dilation factor=5 and filter size=7; for high-level encoder, we use a bi-GRU with the hidden dimension of 512. We set the batch size as 100 during the whole experiments. We use Adam (Kingma and Ba 2015) to optimize our models with the initial learning rate of 0.0004.