Fast Monte-Carlo Approximation of the Attention Mechanism

Authors: Hyunjun Kim, JeongGil Ko7185-7193

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We study the theoretical error bounds and demonstrate that MCA reduces attention complexity (in FLOPS) for various Transformer models by up to 11 in GLUE benchmarks without compromising model accuracy. We implement MCA via custom CUDA kernels, and measure its computation complexity and accuracy as performance metrics. We replace the multi-head attention in BERT (Vaswani et al. 2017) with MCA, and test the performance with GLUE benchmark (Wang et al. 2018).
Researcher Affiliation Academia Hyunjun Kim and Jeong Gil Ko School of Integrated Technology, Yonsei University hyunjun.kim@yonsei.ac.kr, jeonggil.ko@yonsei.ac.kr
Pseudocode No The paper describes algorithms using mathematical equations and textual explanations but does not include structured pseudocode or algorithm blocks with formal labels.
Open Source Code Yes Source code and appendix: https://github.com/eis-lab/monte-carlo-attention
Open Datasets Yes We test the performance with GLUE benchmark (Wang et al. 2018). To evaluate MCA with the Longformer model, we use the ar Xiv Academic Paper dataset (AAPD) (Yang et al. 2016), the IMDB review classification dataset and the Hyperpartisan News Detection (HND) dataset (Kiesel et al. 2019).
Dataset Splits No The paper mentions using standard benchmarks like GLUE but does not explicitly provide specific train/validation/test splits (e.g., percentages, sample counts, or citations to a specific split methodology) for reproducibility.
Hardware Specification Yes We employ AWS p3.2xlarge instances for the experiment, which serves one NVIDIA V100 GPU
Software Dependencies No MCA is implemented as a C++ Py Torch extension using custom CUDA kernels... and use the Huggingface library for implementations and download pre-trained weights for BERT, Distil BERT, and Longformer. The paper does not provide specific version numbers for PyTorch, CUDA, or Huggingface.
Experiment Setup Yes Given that α is an adjustable parameter that determines the attention error bound tightness, we evaluate for α = 0.2, 0.4, 0.6, and 1.0. We configure the Longformer with a window size of 256, and apply global attention for the CLS token.