Interpretable Counting for Visual Question Answering

Authors: Alexander Trott, Caiming Xiong, Richard Socher

ICLR 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Furthermore, our method outperforms the state of the art architecture for VQA on multiple metrics that evaluate counting.
Researcher Affiliation Industry Alexander Trott, Caiming Xiong , & Richard Socher Salesforce Research Palo Alto, CA {atrott,cxiong,rsocher}@salesforce.com
Pseudocode No The paper describes the model and methods using mathematical equations and prose but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code No To facilitate future comparison to our work, we have made the training, development, and test question IDs available for download.
Open Datasets Yes For training and evaluation, we create a new dataset, How Many-QA. It is taken from the counting-specific union of VQA 2.0 (Goyal et al., 2017) and Visual Genome QA (Krishna et al., 2016).
Dataset Splits Yes The original VQA 2.0 train set includes roughly 444K QA pairs, of which 57,606 are labeled as having a number answer. Focusing on counting questions results in a still very large dataset with 47,542 pairs...we divide the validation data into separate development and test sets. More specifically, we apply the above criteria to the official validation data and select 5,000 of the resulting QA pairs to serve as the test data. The remaining 17,714 QA pairs are used as the development set. (Table 1: Train 83,642, Dev. 17,714, Test 5,000)
Hardware Specification No The paper does not provide specific details about the hardware (e.g., GPU model, CPU type, memory) used for running experiments.
Software Dependencies No The paper mentions software components like GloVe, LSTM, and Adam optimizer, but it does not specify their version numbers or other software dependencies with version details.
Experiment Setup Yes When training on counting, we optimize using Adam (Kingma & Ba, 2014). For Soft Count and Up Down, we use a learning rate of 3x10 4 and decay the learning rate by 0.8 when the training accuracy plateaus. For IRLC, we use a learning rate of 5x10 4 and decay the learning rate by 0.99999 every iteration. For all models, we regularize using dropout and apply early stopping based on the development set accuracy (see below). ... We weight the entropy penalty PH and interaction penalty PI (Eq. 10) both by 0.005 relative to the counting loss.