Neural-Symbolic VQA: Disentangling Reasoning from Vision and Language Understanding

Authors: Kexin Yi, Jiajun Wu, Chuang Gan, Antonio Torralba, Pushmeet Kohli, Josh Tenenbaum

NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We marry two powerful ideas: deep representation learning for visual recognition and language understanding, and symbolic program execution for reasoning. Our neural-symbolic visual question answering (NS-VQA) system first recovers a structural scene representation from the image and a program trace from the question. It then executes the program on the scene representation to obtain an answer. Incorporating symbolic structure as prior knowledge offers three unique advantages. First, executing programs on a symbolic space is more robust to long program traces; our model can solve complex reasoning tasks better, achieving an accuracy of 99.8% on the CLEVR dataset.
Researcher Affiliation Collaboration Kexin Yi Harvard University Jiajun Wu MIT CSAIL Chuang Gan MIT-IBM Watson AI Lab Antonio Torralba MIT CSAIL Pushmeet Kohli Deep Mind Joshua B. Tenenbaum MIT CSAIL
Pseudocode No The paper describes the functional modules and their execution flow in text and diagrams (e.g., Figure 2-III) but does not provide a formal pseudocode block or algorithm.
Open Source Code Yes Code of our model is available at https://github.com/kexinyi/ns-vqa
Open Datasets Yes We evaluate our NS-VQA on CLEVR [Johnson et al., 2017a]. The dataset includes synthetic images of 3D primitives with multiple attributes shape, color, material, size, and 3D coordinates.
Dataset Splits Yes Split A has 70K images and 700K questions for training and both splits have 15K images and 150K questions for evaluation and testing. We use the first 9,000 images with 88,109 questions for training and the remaining 1,000 images with 9,761 questions for testing. We evaluate our model s performance on the validation set under various supervise signal for training.
Hardware Specification No The paper refers to software frameworks and models like 'Mask R-CNN' and 'Res Net-50 FPN' and mentions training details such as 'training for 30,000 iterations' but does not specify the exact hardware (e.g., GPU model, CPU, RAM) used for these experiments.
Software Dependencies No The paper states 'All our models are implemented in Py Torch' and 'Our implementation of the object proposal network (Mask R-CNN) is based on Detectron [Girshick et al., 2018]' but does not provide specific version numbers for PyTorch, Detectron, or other software libraries.
Experiment Setup Yes We train the model for 30,000 iterations with eight images per batch... using the mean square error as loss function for 30,000 iterations with learning rate 0.002 and batch size 50. ...During supervised pretraining, we train with learning rate 7 10 4 for 20,000 iterations. For reinforce, we set the learning rate to be 10 5 and run at most 2M iterations... Batch size is fixed to be 64 for both training stages.