reproducibilityindex.ai

Neural-Symbolic VQA: Disentangling Reasoning from Vision and Language Understanding

Authors: Kexin Yi, Jiajun Wu, Chuang Gan, Antonio Torralba, Pushmeet Kohli, Josh Tenenbaum

NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We marry two powerful ideas: deep representation learning for visual recognition and language understanding, and symbolic program execution for reasoning. Our neural-symbolic visual question answering (NS-VQA) system ﬁrst recovers a structural scene representation from the image and a program trace from the question. It then executes the program on the scene representation to obtain an answer. Incorporating symbolic structure as prior knowledge offers three unique advantages. First, executing programs on a symbolic space is more robust to long program traces; our model can solve complex reasoning tasks better, achieving an accuracy of 99.8% on the CLEVR dataset.
Researcher Affiliation	Collaboration	Kexin Yi Harvard University Jiajun Wu MIT CSAIL Chuang Gan MIT-IBM Watson AI Lab Antonio Torralba MIT CSAIL Pushmeet Kohli Deep Mind Joshua B. Tenenbaum MIT CSAIL
Pseudocode	No	The paper describes the functional modules and their execution flow in text and diagrams (e.g., Figure 2-III) but does not provide a formal pseudocode block or algorithm.
Open Source Code	Yes	Code of our model is available at https://github.com/kexinyi/ns-vqa
Open Datasets	Yes	We evaluate our NS-VQA on CLEVR [Johnson et al., 2017a]. The dataset includes synthetic images of 3D primitives with multiple attributes shape, color, material, size, and 3D coordinates.
Dataset Splits	Yes	Split A has 70K images and 700K questions for training and both splits have 15K images and 150K questions for evaluation and testing. We use the ﬁrst 9,000 images with 88,109 questions for training and the remaining 1,000 images with 9,761 questions for testing. We evaluate our model s performance on the validation set under various supervise signal for training.
Hardware Specification	No	The paper refers to software frameworks and models like 'Mask R-CNN' and 'Res Net-50 FPN' and mentions training details such as 'training for 30,000 iterations' but does not specify the exact hardware (e.g., GPU model, CPU, RAM) used for these experiments.
Software Dependencies	No	The paper states 'All our models are implemented in Py Torch' and 'Our implementation of the object proposal network (Mask R-CNN) is based on Detectron [Girshick et al., 2018]' but does not provide specific version numbers for PyTorch, Detectron, or other software libraries.
Experiment Setup	Yes	We train the model for 30,000 iterations with eight images per batch... using the mean square error as loss function for 30,000 iterations with learning rate 0.002 and batch size 50. ...During supervised pretraining, we train with learning rate 7 10 4 for 20,000 iterations. For reinforce, we set the learning rate to be 10 5 and run at most 2M iterations... Batch size is ﬁxed to be 64 for both training stages.