Aligning Visual Regions and Textual Concepts for Semantic-Grounded Image Representations
Authors: Fenglin Liu, Yuanxin Liu, Xuancheng Ren, Xiaodong He, Xu Sun
NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate the proposed approach on two representative vision-and-language grounding tasks, i.e., image captioning and visual question answering. In both tasks, the semanticgrounded image representations consistently boost the performance of the baseline models under all metrics across the board. The results demonstrate that our approach is effective and generalizes well to a wide range of models for image-related applications. |
| Researcher Affiliation | Collaboration | Fenglin Liu1 , Yuanxin Liu3,4 , Xuancheng Ren2 , Xiaodong He5, Xu Sun2 1ADSPLAB, School of ECE, Peking University, Shenzhen, China 2MOE Key Laboratory of Computational Linguistics, School of EECS, Peking University 3Institute of Information Engineering, Chinese Academy of Sciences 4School of Cyber Security, University of Chinese Academy of Sciences 5JD AI Research {fenglinliu98, renxc, xusun}@pku.edu.cn, liuyuanxin@iie.ac.cn xiaodong.he@jd.com |
| Pseudocode | No | The paper describes algorithms and formulations in text and equations but does not include structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | 2The code is available at https://github.com/fenglinliu98/MIA |
| Open Datasets | Yes | We conduct experiments on the MSCOCO image captioning dataset [7] and use SPICE [1], CIDEr [29], BLEU [22], METEOR [5] and ROUGE [14] as evaluation metrics... We experiment on the VQA v2.0 dataset [9], which is comprised of image-based question-answer pairs labeled by human annotators. |
| Dataset Splits | Yes | We evaluate with iteration times ranging from 1 to 5. The scores first rise and then decline with the increase of N, as a holistic trend. With one accord, the performances consistently reach the best at the second iteration, for the reason of which we set N = 2. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for running the experiments (e.g., GPU models, CPU types, memory). |
| Software Dependencies | No | The paper mentions models and architectures like ResNet-152, Faster R-CNN, GRU, LSTM, and Transformer, but does not provide specific version numbers for software dependencies or libraries. |
| Experiment Setup | Yes | Particularly, we use 8 heads (k = 8) and iterate twice (N = 2), according to the performance on the validation set. For detailed settings, please refer to the supplementary material. |