Exploring and Distilling Cross-Modal Information for Image Captioning

Authors: Fenglin Liu, Xuancheng Ren, Yuanxin Liu, Kai Lei, Xu Sun

IJCAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The experiments on COCO image captioning dataset validate our argument and prove the effectiveness of the proposed approach.
Researcher Affiliation Academia 1Shenzhen Key Lab for Information Centric Networking & Blockchain Technology (ICNLAB), School of Electronics and Computer Engineering (SECE), Peking University 2MOE Key Laboratory of Computational Linguistics, School of EECS, Peking University 3School of ICE, Beijing University of Posts and Telecommunications
Pseudocode No The paper describes the model architecture and equations but does not contain a structured pseudocode or algorithm block.
Open Source Code No The paper does not provide concrete access to source code for the methodology described.
Open Datasets Yes We evaluate the proposed approach on the widely-used COCO dataset [Chen et al., 2015], which contains 123,287 images.
Dataset Splits Yes We use the publicly-available splits in [Karpathy and Li, 2015] for offline evaluation. There are 5,000 images each in validation set and test set for COCO.
Hardware Specification Yes Time and Speed is measured on a single NVIDIA Ge Force GTX 1080 Ti.
Software Dependencies No The paper mentions general software components (e.g., Python, PyTorch, TensorFlow) but does not provide specific version numbers for these or other libraries/solvers.
Experiment Setup Yes The word embedding size and model size are 256 and 512, respectively, and in implementation, we share the attribute embedding and the input word embedding. The number of heads n in multi-head attention is set to 8 unless otherwise stated. We train the model with both cross-entropy loss and reinforcement learning optimizing CIDEr. The model is trained with batch size of 80 for 25 epochs with early stopping based on CIDEr with cross-entropy loss, followed by reinforcement learning. We use Adam [Kingma and Ba, 2014] with a learning rate of 10 4 for parameter optimization. We also apply beam search with beam size = 3 during inference.