Hierarchical Photo-Scene Encoder for Album Storytelling

Authors: Bairui Wang, Lin Ma, Wei Zhang, Wenhao Jiang, Feng Zhang8909-8916

AAAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments In this section, we evaluate the effectiveness of our proposed model on album storytelling. We first describe the datasets used for evaluation, followed by a brief description of competitor models. Afterward, the experimental results on album storytelling are illustrated and discussed.
Researcher Affiliation Collaboration Bairui Wang,1 Lin Ma,2 Wei Zhang,1 Wenhao Jiang,2 Feng Zhang2 1School of Control Science and Engineering, Shandong University, 2Tencent AI Lab {bairuiwong, forest.linma, cswhjiang}@gmail.com davidzhang@sdu.edu.cn, jayzhang@tencent.com
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide concrete access to source code for the methodology described in this paper.
Open Datasets Yes To compare with existing methods, we evaluate the proposed album storytelling model on the visual storytelling dataset (VIST) (Huang et al. 2016), which is particularly created for the task of album storytelling.
Dataset Splits Yes The VIST dataset is split into three parts, i.e., 8, 031 albums for training, 998 for validation, and 1, 011 for testing.
Hardware Specification No The paper does not provide specific hardware details used for running its experiments.
Software Dependencies No The paper mentions software components like ResNet and Adam optimizer, but does not provide specific version numbers for any libraries or frameworks used (e.g., PyTorch, TensorFlow versions).
Experiment Setup Yes In this section, we describe the detailed configurations and implementation details of our proposed whole network, including the hierarchical photo-scene encoder, decoder, and reconstructor. For the sentences, the word that occurs less than 5 times are eliminated. And each sentence within each story is truncated to 25 words, with each word is embedded as a 512-dimensional vector. For album, same as (Yu, Bansal, and Berg 2017), we also truncate the photo stream, which contains only 40 photos, instead of using only 5 labeled photos for each album. For each photo, we use the Res Net101 pre-trained on the ILSVRC2012-CLS dataset (Russakovsky et al. 2015) as the feature extractor to generate 2048-dimensional feature. The sizes of all GRUs in the hierarchical model and linear function in both photo and scene encoders are set as 512. For decoder, since the number of scene representations are dynamic, the dimension of weight vector is decided by the total number of photo and scene features. The hidden states of GRUs are initialized to zero, except that the attention GRU is initialized by the final state of photo encoder. We use the Adam (Kingma and Ba 2014) as the optimizer, with the initial learning rate being set as 0.0004 while other parameters using the recommended parameters. The training process terminates when the value of CIDEr metric on validation stops growing in 30 validations. Training the whole network performs in two stages. First, the encode-decoder is trained until convergence. Afterwards, the reconstructor is stacked to perform a joint training with the loss function defined in Eq. (11).