Chain of Reasoning for Visual Question Answering

Authors: Chenfei Wu, Jinlai Liu, Xiaojie Wang, Xuan Dong

NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We achieve new state-of-the-art results on four publicly available datasets. We conduct a detailed ablation study to show that our proposed chain structure is superior to stack structure and parallel structure.
Researcher Affiliation Academia Chenfei Wu , Jinlai Liu , Xiaojie Wang, Xuan Dong Center for Intelligence Science and Technology Beijing University of Posts and Telecommunications {wuchenfei,liujinlai, xjwang, dongxuan8811}@bupt.edu.cn
Pseudocode No The paper describes the model architecture and operations using mathematical equations (Eq. 1-13) and textual explanations, but it does not include any structured pseudocode or algorithm blocks.
Open Source Code No More details, including source codes, will be published in the near future.
Open Datasets Yes We evaluate our model on four public datasets: the VQA 1.0 dataset [29], the VQA 2.0 dataset [30], the COCO-QA dataset[31] and the TDIUC dataset [27].
Dataset Splits Yes For a fair comparion, all the data provided in this section are trained under the VQA 2.0 training set and tested on the VQA 2.0 validation set.
Hardware Specification No The paper mentions implementing the model using Pytorch but does not specify any hardware details like GPU models, CPU types, or cloud computing environments used for running the experiments.
Software Dependencies No We implement the model using Pytorch. We use Adam[35] to train the model. While software names like Pytorch and Adam are mentioned, no specific version numbers are provided for these or other dependencies.
Experiment Setup Yes During the data-embedding phase, the image features are mapped to the size of 36 2048 and the text features are mapped to the size of 2400. In the chain of reasoning phase, the number of hidden layer in Mutan is 510; hyperparameter K is 5. The attention hidden unit number is 620. In the decision making phase, the joint feature embedding is set to 510. All the nonlinear layers of the model all use the relu activation function and dropout [34] to prevent overfitting. We use Adam[35] to train the model with a learning rate of 10 4 and a batch_size of 64.