reproducibilityindex.ai

Bilinear Attention Networks

Authors: Jin-Hwa Kim, Jaehyun Jun, Byoung-Tak Zhang

NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We quantitatively and qualitatively evaluate our model on visual question answering (VQA 2.0) and Flickr30k Entities datasets, showing that BAN signiﬁcantly outperforms previous methods and achieves new state-of-the-arts on both datasets.
Researcher Affiliation	Collaboration	1SK T-Brain, 2Seoul National University, 3Surromind Robotics
Pseudocode	No	The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code	No	The paper does not provide concrete access to its own source code. The only GitHub link provided (https://github.com/yuzcccc/vqa-mfb) is for a comparative model (MFH), not the authors' BAN implementation.
Open Datasets	Yes	Visual Question Answering (VQA). We evaluate on the VQA 2.0 dataset [1, 8]... Flickr30k Entities. For the evaluation of visual grounding by the bilinear attention maps, we use Flickr30k Entities [23] consisting of 31,783 images [38] and 244,035 annotations...
Dataset Splits	No	The paper mentions using 'train and validation splits' and that the 'test set is split into test-dev, test-standard, test-challenge, and test-reserve,' but it does not provide specific percentages, sample counts, or explicit details for how these splits were performed or accessed, beyond referencing the VQA 2.0 dataset itself.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., exact GPU/CPU models, memory amounts) used for running its experiments.
Software Dependencies	No	The paper mentions various software components and techniques (e.g., GloVe, GRU, Adamax optimizer, Weight Normalization, Dropout, ReLU, Faster R-CNN) but does not provide specific version numbers for these or any other software dependencies, such as the deep learning framework or Python version used.
Experiment Setup	Yes	The size of image features and question embeddings are M = 2, 048 and N = 1, 024, respectively. The size of joint representation C is the same with the rank K in lowrank bilinear pooling, C = K = 1, 024, but K0 = K 3 is used in the bilinear attention maps to increase a representational capacity for residual learning of attention. Every linear mapping is regularized by Weight Normalization [27] and Dropout [28] (p = .2, except for the classiﬁer with .5). Adamax optimizer [16], a variant of Adam based on inﬁnite norm, is used. The learning rate is min(ie 3, 4e 3) where i is the number of epochs starting from 1, then after 10 epochs, the learning rate is decayed by 1/4 for every 2 epochs up to 13 epochs (i.e. 1e 3 for 11-th and 2.5e 4 for 13-th epoch). We clip the 2-norm of vectorized gradients to .25. The batch size is 512.