From Pixels to Objects: Cubic Visual Attention for Visual Question Answering
Authors: Jingkuan Song, Pengpeng Zeng, Lianli Gao, Heng Tao Shen
IJCAI 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We assess the performance of our proposed CVA on three public image QA datasets, including COCO-QA, VQA and Visual7W. Experimental results show that our proposed method significantly outperforms the state-of-the-arts. |
| Researcher Affiliation | Academia | Center for Future Media and School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu 611731, China |
| Pseudocode | No | The paper describes the model architecture and mathematical formulations, but it does not include any pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | No | The paper does not provide any statement or link indicating that the source code for their methodology is publicly available. |
| Open Datasets | Yes | We evaluate our proposed model on three public image QA datasets: the COCO-QA dataset [Ren et al., 2015], the VQA dataset (collected from the newly-released Microsoft Common Objects in Context (MS COCO) dataset), and Visual7W dataset (collected recently by Zhu et al [Zhu et al., 2016]). |
| Dataset Splits | Yes | For the VQA dataset, 204,721 real images (123,287 training and validation vs 81,434 testing ) are collected from the newly-released Microsoft Common Objects in Context (MS COCO) dataset. The paper also mentions 'test-dev' for debugging and validation purposes. |
| Hardware Specification | No | The paper does not provide specific details about the hardware (e.g., CPU, GPU models, memory) used for running its experiments. |
| Software Dependencies | No | The paper mentions software components such as 'Faster R-CNN', 'Res Net-101', 'Glo Ve word embedding', and 'Adam' optimizer, but it does not specify version numbers for any of these components. |
| Experiment Setup | Yes | The paper states: 'For extracting visual object features... select top 36 (k = 36) object regions and each region is represented as 2,048 dimensional features.' 'the dimension of every hidden layer including GRU, attention models and the final joint feature embedding is set as 1,024.' 'our models are trained with Adam. The batch size is set to 256, and the epoch is set as 30. More specifically, gradient clipping technology and dropout are exploited in training.' |