Bilinear Attention Networks
Authors: Jin-Hwa Kim, Jaehyun Jun, Byoung-Tak Zhang
NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We quantitatively and qualitatively evaluate our model on visual question answering (VQA 2.0) and Flickr30k Entities datasets, showing that BAN significantly outperforms previous methods and achieves new state-of-the-arts on both datasets. |
| Researcher Affiliation | Collaboration | 1SK T-Brain, 2Seoul National University, 3Surromind Robotics |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide concrete access to its own source code. The only GitHub link provided (https://github.com/yuzcccc/vqa-mfb) is for a comparative model (MFH), not the authors' BAN implementation. |
| Open Datasets | Yes | Visual Question Answering (VQA). We evaluate on the VQA 2.0 dataset [1, 8]... Flickr30k Entities. For the evaluation of visual grounding by the bilinear attention maps, we use Flickr30k Entities [23] consisting of 31,783 images [38] and 244,035 annotations... |
| Dataset Splits | No | The paper mentions using 'train and validation splits' and that the 'test set is split into test-dev, test-standard, test-challenge, and test-reserve,' but it does not provide specific percentages, sample counts, or explicit details for how these splits were performed or accessed, beyond referencing the VQA 2.0 dataset itself. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., exact GPU/CPU models, memory amounts) used for running its experiments. |
| Software Dependencies | No | The paper mentions various software components and techniques (e.g., GloVe, GRU, Adamax optimizer, Weight Normalization, Dropout, ReLU, Faster R-CNN) but does not provide specific version numbers for these or any other software dependencies, such as the deep learning framework or Python version used. |
| Experiment Setup | Yes | The size of image features and question embeddings are M = 2, 048 and N = 1, 024, respectively. The size of joint representation C is the same with the rank K in lowrank bilinear pooling, C = K = 1, 024, but K0 = K 3 is used in the bilinear attention maps to increase a representational capacity for residual learning of attention. Every linear mapping is regularized by Weight Normalization [27] and Dropout [28] (p = .2, except for the classifier with .5). Adamax optimizer [16], a variant of Adam based on infinite norm, is used. The learning rate is min(ie 3, 4e 3) where i is the number of epochs starting from 1, then after 10 epochs, the learning rate is decayed by 1/4 for every 2 epochs up to 13 epochs (i.e. 1e 3 for 11-th and 2.5e 4 for 13-th epoch). We clip the 2-norm of vectorized gradients to .25. The batch size is 512. |