reproducibilityindex.ai

Fast Monte-Carlo Approximation of the Attention Mechanism

Authors: Hyunjun Kim, JeongGil Ko7185-7193

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We study the theoretical error bounds and demonstrate that MCA reduces attention complexity (in FLOPS) for various Transformer models by up to 11 in GLUE benchmarks without compromising model accuracy. We implement MCA via custom CUDA kernels, and measure its computation complexity and accuracy as performance metrics. We replace the multi-head attention in BERT (Vaswani et al. 2017) with MCA, and test the performance with GLUE benchmark (Wang et al. 2018).
Researcher Affiliation	Academia	Hyunjun Kim and Jeong Gil Ko School of Integrated Technology, Yonsei University hyunjun.kim@yonsei.ac.kr, jeonggil.ko@yonsei.ac.kr
Pseudocode	No	The paper describes algorithms using mathematical equations and textual explanations but does not include structured pseudocode or algorithm blocks with formal labels.
Open Source Code	Yes	Source code and appendix: https://github.com/eis-lab/monte-carlo-attention
Open Datasets	Yes	We test the performance with GLUE benchmark (Wang et al. 2018). To evaluate MCA with the Longformer model, we use the ar Xiv Academic Paper dataset (AAPD) (Yang et al. 2016), the IMDB review classiﬁcation dataset and the Hyperpartisan News Detection (HND) dataset (Kiesel et al. 2019).
Dataset Splits	No	The paper mentions using standard benchmarks like GLUE but does not explicitly provide specific train/validation/test splits (e.g., percentages, sample counts, or citations to a specific split methodology) for reproducibility.
Hardware Specification	Yes	We employ AWS p3.2xlarge instances for the experiment, which serves one NVIDIA V100 GPU
Software Dependencies	No	MCA is implemented as a C++ Py Torch extension using custom CUDA kernels... and use the Huggingface library for implementations and download pre-trained weights for BERT, Distil BERT, and Longformer. The paper does not provide specific version numbers for PyTorch, CUDA, or Huggingface.
Experiment Setup	Yes	Given that α is an adjustable parameter that determines the attention error bound tightness, we evaluate for α = 0.2, 0.4, 0.6, and 1.0. We conﬁgure the Longformer with a window size of 256, and apply global attention for the CLS token.