Joint Commonsense and Relation Reasoning for Image and Video Captioning

Authors: Jingyi Hou, Xinxiao Wu, Xiaoxun Zhang, Yayun Qi, Yunde Jia, Jiebo Luo10973-10980

AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on several benchmark datasets validate the effectiveness of our prior knowledge-based approach. We conduct experiments on a video captioning dataset, MSVD (Guadarrama et al. 2013), and an image captioning dataset, MSCOCO (Lin et al. 2014).
Researcher Affiliation Collaboration Jingyi Hou,1 Xinxiao Wu,1 Xiaoxun Zhang,2 Yayun Qi,1 Yunde Jia,1 Jiebo Luo3 1Lab. of IIT, School of Computer Science, Beijing Institute of Technology, Beijing 100081, China 2Alibaba Group 3Department of Computer Science, University of Rochester, Rochester NY 14627, USA
Pseudocode Yes Algorithm 1: C-R Reasoning. Input: Visual feature vectors V = N n=1Vn and knowledge vectors K = N n=1Kn of N images or videos. Output: C-R Reasoning model.
Open Source Code No The paper does not provide any statement or link regarding the availability of its source code.
Open Datasets Yes We conduct experiments on a video captioning dataset, MSVD (Guadarrama et al. 2013), and an image captioning dataset, MSCOCO (Lin et al. 2014). We employ external knowledge graphs in Visual Genome (Krishna et al. 2017)...
Dataset Splits Yes The MSVD dataset ... We follow the split in (Venugopalan et al. 2015a) which divides the videos into three parts: 1,200 training videos, 100 validation videos and 670 testing videos. ... We follow the standard split by (Karpathy and Fei-Fei 2017) which takes 113,287 images for training, 5,000 for validation and 5,000 for testing.
Hardware Specification No The paper mentions deep learning model architectures like 'Res Ne Xt-101' and 'IRv2' used for feature extraction, but does not specify any particular hardware (e.g., GPU models, CPU models, memory) used for running the experiments.
Software Dependencies No The paper mentions 'NLTK toolkit (Xue 2011)', 'RPN (Ren et al. 2017)', and 'jieba3 https://github.com/fxsjy/jieba'. While specific tools are mentioned, precise version numbers for major software dependencies or frameworks (e.g., Python, PyTorch, TensorFlow) are not provided.
Experiment Setup Yes In visual mapping, ... the number of clusters is set from 5 to 10. In knowledge mapping, the number of the sparse attention operations is set to 3... In the sequence-based language model, both the number of hidden units in each LSTM and the size of the input word embedding is set to 512. ... we set λ = 0.01 and γ = 0.3, empirically. ... the sizes of beam search are set to 3 and 5... we set β to 0 during the first few epochs of training, and 0.1 afterwards.