Bayesian Attention Modules

Authors: Xinjie Fan, Shujian Zhang, Bo Chen, Mingyuan Zhou

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments show the proposed method brings consistent improvements over the corresponding baselines. We evaluate the proposed stochastic attention module on a broad range of tasks, including graph node classification, visual question answering, image captioning, machine translation, and language understanding, where attention plays an important role.
Researcher Affiliation Academia 1The University of Texas at Austin and 2Xidian University xfan@utexas.edu, szhang19@utexas.edu, bchen@mail.xidian.edu.cn, mingyuan.zhou@mccombs.utexas.edu
Pseudocode Yes This way provides unbiased and low-variance gradient estimates (see the pseudo code in Algorithm 1 in Appendix).
Open Source Code Yes Python code is available at https://github.com/zhougroup/BAM
Open Datasets Yes We experiment with three benchmark graphs, including Cora, Citeseer, and Pubmed, for node classification...on the VQA-v2 dataset [43], consisting of human-annotated question-answer pairs for images from the MS-COCO dataset [44]...We conduct our experiments on MS-COCO [44]...GLUE) [59] and two versions of Stanford Question Answering Datasets (SQu AD) [60, 61].
Dataset Splits Yes In GAT [7], for each dataset, 20 nodes per class are used for training, 500 for validation, and 1000 for testing, with the rest used as unlabeled training data.
Hardware Specification Yes All experiments are conducted on a single Nvidia Tesla V100 GPU with 16 GB memory.
Software Dependencies No The paper mentions software like 'Huggingface Py Torch Transformer' and optimizers like 'Adam' but does not specify their version numbers.
Experiment Setup Yes We use a 2-layer GAT model with 8 attention heads, with a dropout of 0.6 applied to both the input and GAT layers, and an ELU activation function [67] (with α = 1.0) applied to each layer. The learning rate is set to 0.005 with Adam optimizer [69] and 500 epochs with early stopping after 100 epochs.