Multimodal Residual Learning for Visual QA
Authors: Jin-Hwa Kim, Sang-Woo Lee, Donghyun Kwak, Min-Oh Heo, Jeonghee Kim, Jung-Woo Ha, Byoung-Tak Zhang
NeurIPS 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We achieve the state-of-the-art results on the Visual QA dataset for both Open-Ended and Multiple-Choice tasks. |
| Researcher Affiliation | Collaboration | Jin-Hwa Kim Sang-Woo Lee Donghyun Kwak Min-Oh Heo Seoul National University {jhkim,slee,dhkwak,moheo}@bi.snu.ac.kr Jeonghee Kim Jung-Woo Ha Naver Labs, Naver Corp. {jeonghee.kim,jungwoo.ha}@navercorp.com Byoung-Tak Zhang Seoul National University & Surromind Robotics btzhang@bi.snu.ac.kr |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide concrete access to source code for the described methodology (Multimodal Residual Networks) via a specific repository link, explicit code release statement, or code in supplementary materials. |
| Open Datasets | Yes | We choose the Visual QA (VQA) dataset [1] for the evaluation of our models. [1] Aishwarya Agrawal, Jiasen Lu, Stanislaw Antol, Margaret Mitchell, C. Lawrence Zitnick, Dhruv Batra, and Devi Parikh. VQA: Visual Question Answering. In International Conference on Computer Vision, 2015. |
| Dataset Splits | Yes | The images come from the MS-COCO dataset, 123,287 of them for training and validation, and 81,434 for test. All validation is performed on the test-dev split. |
| Hardware Specification | No | The paper does not provide specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments. |
| Software Dependencies | No | The paper mentions 'Torch framework and rnn package [13]' and 'Python Natural Language Toolkit (nltk) [3]' but does not provide specific version numbers for these or other key software components. |
| Experiment Setup | Yes | The common embedding size of the joint representation is 1,200. The learnable parameters are initialized using a uniform distribution from 0.08 to 0.08 except for the pretrained models. The batch size is 200, and the number of iterations is fixed to 250k. The RMSProp [26] is used for optimization, and dropouts [7, 5] are used for regularization. |