Visual Question Answering with Question Representation Update (QRU)

Authors: Ruiyu Li, Jiaya Jia

NeurIPS 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our method is evaluated on challenging datasets of COCO-QA [19] and VQA [2] and yields state-of-the-art performance.
Researcher Affiliation Academia Ruiyu Li Jiaya Jia The Chinese University of Hong Kong {ryli,leojia}@cse.cuhk.edu.hk
Pseudocode No The paper describes the model architecture and mathematical formulations (e.g., equations 1-7). However, it does not contain a dedicated pseudocode block, algorithm box, or clearly formatted algorithmic steps.
Open Source Code No The paper does not contain any explicit statement about releasing source code for the methodology described, nor does it provide a link to a code repository.
Open Datasets Yes We conduct experiments on COCO-QA [19] and VQA [2]. The COCO-QA dataset is based on Microsoft COCO image data [13]. There are 78,736 training questions and 38,948 test ones, based on a total of 123,287 images. In the VQA dataset, each image from the COCO data is annotated by Amazon Mechanical Turk (AMT) with three questions. There are 248,349, 121,512 and 244,302 questions for training, validation and testing, respectively.
Dataset Splits Yes For the COCO-QA dataset, we set the dimension of common latent space to 1,024. Since VQA dataset is larger than COCO-QA, we double the dimension of common latent space to adapt the data and classes. There are 248,349, 121,512 and 244,302 questions for training, validation and testing, respectively.
Hardware Specification Yes We thank NVIDIA for providing Ruiyu Li a Tesla K40 GPU accelerator for this work.
Software Dependencies No We implement our network using the public Torch computing framework. The paper mentions Torch but does not specify a version number or other software dependencies with their respective versions.
Experiment Setup Yes The network is trained in an end-to-end fashion using stochastic gradient descent with mini-batches of 100 samples and momentum 0.9. The learning rate starts from 10 3 and decreases by a factor of 10 when validation accuracy stops improving. We use dropout and gradient clipping to regularize the training process. For the COCO-QA dataset, we set the dimension of common latent space to 1,024. Since VQA dataset is larger than COCO-QA, we double the dimension of common latent space to adapt the data and classes.