reproducibilityindex.ai

Bayesian Attention Modules

Authors: Xinjie Fan, Shujian Zhang, Bo Chen, Mingyuan Zhou

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments show the proposed method brings consistent improvements over the corresponding baselines. We evaluate the proposed stochastic attention module on a broad range of tasks, including graph node classiﬁcation, visual question answering, image captioning, machine translation, and language understanding, where attention plays an important role.
Researcher Affiliation	Academia	1The University of Texas at Austin and 2Xidian University xfan@utexas.edu, szhang19@utexas.edu, bchen@mail.xidian.edu.cn, mingyuan.zhou@mccombs.utexas.edu
Pseudocode	Yes	This way provides unbiased and low-variance gradient estimates (see the pseudo code in Algorithm 1 in Appendix).
Open Source Code	Yes	Python code is available at https://github.com/zhougroup/BAM
Open Datasets	Yes	We experiment with three benchmark graphs, including Cora, Citeseer, and Pubmed, for node classiﬁcation...on the VQA-v2 dataset [43], consisting of human-annotated question-answer pairs for images from the MS-COCO dataset [44]...We conduct our experiments on MS-COCO [44]...GLUE) [59] and two versions of Stanford Question Answering Datasets (SQu AD) [60, 61].
Dataset Splits	Yes	In GAT [7], for each dataset, 20 nodes per class are used for training, 500 for validation, and 1000 for testing, with the rest used as unlabeled training data.
Hardware Specification	Yes	All experiments are conducted on a single Nvidia Tesla V100 GPU with 16 GB memory.
Software Dependencies	No	The paper mentions software like 'Huggingface Py Torch Transformer' and optimizers like 'Adam' but does not specify their version numbers.
Experiment Setup	Yes	We use a 2-layer GAT model with 8 attention heads, with a dropout of 0.6 applied to both the input and GAT layers, and an ELU activation function [67] (with α = 1.0) applied to each layer. The learning rate is set to 0.005 with Adam optimizer [69] and 500 epochs with early stopping after 100 epochs.