Bayesian Attention Modules
Authors: Xinjie Fan, Shujian Zhang, Bo Chen, Mingyuan Zhou
NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments show the proposed method brings consistent improvements over the corresponding baselines. We evaluate the proposed stochastic attention module on a broad range of tasks, including graph node classification, visual question answering, image captioning, machine translation, and language understanding, where attention plays an important role. |
| Researcher Affiliation | Academia | 1The University of Texas at Austin and 2Xidian University xfan@utexas.edu, szhang19@utexas.edu, bchen@mail.xidian.edu.cn, mingyuan.zhou@mccombs.utexas.edu |
| Pseudocode | Yes | This way provides unbiased and low-variance gradient estimates (see the pseudo code in Algorithm 1 in Appendix). |
| Open Source Code | Yes | Python code is available at https://github.com/zhougroup/BAM |
| Open Datasets | Yes | We experiment with three benchmark graphs, including Cora, Citeseer, and Pubmed, for node classification...on the VQA-v2 dataset [43], consisting of human-annotated question-answer pairs for images from the MS-COCO dataset [44]...We conduct our experiments on MS-COCO [44]...GLUE) [59] and two versions of Stanford Question Answering Datasets (SQu AD) [60, 61]. |
| Dataset Splits | Yes | In GAT [7], for each dataset, 20 nodes per class are used for training, 500 for validation, and 1000 for testing, with the rest used as unlabeled training data. |
| Hardware Specification | Yes | All experiments are conducted on a single Nvidia Tesla V100 GPU with 16 GB memory. |
| Software Dependencies | No | The paper mentions software like 'Huggingface Py Torch Transformer' and optimizers like 'Adam' but does not specify their version numbers. |
| Experiment Setup | Yes | We use a 2-layer GAT model with 8 attention heads, with a dropout of 0.6 applied to both the input and GAT layers, and an ELU activation function [67] (with α = 1.0) applied to each layer. The learning rate is set to 0.005 with Adam optimizer [69] and 500 epochs with early stopping after 100 epochs. |