Bilinear Attention Networks

Authors: Jin-Hwa Kim, Jaehyun Jun, Byoung-Tak Zhang

NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We quantitatively and qualitatively evaluate our model on visual question answering (VQA 2.0) and Flickr30k Entities datasets, showing that BAN significantly outperforms previous methods and achieves new state-of-the-arts on both datasets.
Researcher Affiliation Collaboration 1SK T-Brain, 2Seoul National University, 3Surromind Robotics
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide concrete access to its own source code. The only GitHub link provided (https://github.com/yuzcccc/vqa-mfb) is for a comparative model (MFH), not the authors' BAN implementation.
Open Datasets Yes Visual Question Answering (VQA). We evaluate on the VQA 2.0 dataset [1, 8]... Flickr30k Entities. For the evaluation of visual grounding by the bilinear attention maps, we use Flickr30k Entities [23] consisting of 31,783 images [38] and 244,035 annotations...
Dataset Splits No The paper mentions using 'train and validation splits' and that the 'test set is split into test-dev, test-standard, test-challenge, and test-reserve,' but it does not provide specific percentages, sample counts, or explicit details for how these splits were performed or accessed, beyond referencing the VQA 2.0 dataset itself.
Hardware Specification No The paper does not provide specific hardware details (e.g., exact GPU/CPU models, memory amounts) used for running its experiments.
Software Dependencies No The paper mentions various software components and techniques (e.g., GloVe, GRU, Adamax optimizer, Weight Normalization, Dropout, ReLU, Faster R-CNN) but does not provide specific version numbers for these or any other software dependencies, such as the deep learning framework or Python version used.
Experiment Setup Yes The size of image features and question embeddings are M = 2, 048 and N = 1, 024, respectively. The size of joint representation C is the same with the rank K in lowrank bilinear pooling, C = K = 1, 024, but K0 = K 3 is used in the bilinear attention maps to increase a representational capacity for residual learning of attention. Every linear mapping is regularized by Weight Normalization [27] and Dropout [28] (p = .2, except for the classifier with .5). Adamax optimizer [16], a variant of Adam based on infinite norm, is used. The learning rate is min(ie 3, 4e 3) where i is the number of epochs starting from 1, then after 10 epochs, the learning rate is decayed by 1/4 for every 2 epochs up to 13 epochs (i.e. 1e 3 for 11-th and 2.5e 4 for 13-th epoch). We clip the 2-norm of vectorized gradients to .25. The batch size is 512.