Aligning Visual Regions and Textual Concepts for Semantic-Grounded Image Representations

Authors: Fenglin Liu, Yuanxin Liu, Xuancheng Ren, Xiaodong He, Xu Sun

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate the proposed approach on two representative vision-and-language grounding tasks, i.e., image captioning and visual question answering. In both tasks, the semanticgrounded image representations consistently boost the performance of the baseline models under all metrics across the board. The results demonstrate that our approach is effective and generalizes well to a wide range of models for image-related applications.
Researcher Affiliation Collaboration Fenglin Liu1 , Yuanxin Liu3,4 , Xuancheng Ren2 , Xiaodong He5, Xu Sun2 1ADSPLAB, School of ECE, Peking University, Shenzhen, China 2MOE Key Laboratory of Computational Linguistics, School of EECS, Peking University 3Institute of Information Engineering, Chinese Academy of Sciences 4School of Cyber Security, University of Chinese Academy of Sciences 5JD AI Research {fenglinliu98, renxc, xusun}@pku.edu.cn, liuyuanxin@iie.ac.cn xiaodong.he@jd.com
Pseudocode No The paper describes algorithms and formulations in text and equations but does not include structured pseudocode or algorithm blocks.
Open Source Code Yes 2The code is available at https://github.com/fenglinliu98/MIA
Open Datasets Yes We conduct experiments on the MSCOCO image captioning dataset [7] and use SPICE [1], CIDEr [29], BLEU [22], METEOR [5] and ROUGE [14] as evaluation metrics... We experiment on the VQA v2.0 dataset [9], which is comprised of image-based question-answer pairs labeled by human annotators.
Dataset Splits Yes We evaluate with iteration times ranging from 1 to 5. The scores first rise and then decline with the increase of N, as a holistic trend. With one accord, the performances consistently reach the best at the second iteration, for the reason of which we set N = 2.
Hardware Specification No The paper does not provide specific details about the hardware used for running the experiments (e.g., GPU models, CPU types, memory).
Software Dependencies No The paper mentions models and architectures like ResNet-152, Faster R-CNN, GRU, LSTM, and Transformer, but does not provide specific version numbers for software dependencies or libraries.
Experiment Setup Yes Particularly, we use 8 heads (k = 8) and iterate twice (N = 2), according to the performance on the validation set. For detailed settings, please refer to the supplementary material.