Fast Monte-Carlo Approximation of the Attention Mechanism
Authors: Hyunjun Kim, JeongGil Ko7185-7193
AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We study the theoretical error bounds and demonstrate that MCA reduces attention complexity (in FLOPS) for various Transformer models by up to 11 in GLUE benchmarks without compromising model accuracy. We implement MCA via custom CUDA kernels, and measure its computation complexity and accuracy as performance metrics. We replace the multi-head attention in BERT (Vaswani et al. 2017) with MCA, and test the performance with GLUE benchmark (Wang et al. 2018). |
| Researcher Affiliation | Academia | Hyunjun Kim and Jeong Gil Ko School of Integrated Technology, Yonsei University hyunjun.kim@yonsei.ac.kr, jeonggil.ko@yonsei.ac.kr |
| Pseudocode | No | The paper describes algorithms using mathematical equations and textual explanations but does not include structured pseudocode or algorithm blocks with formal labels. |
| Open Source Code | Yes | Source code and appendix: https://github.com/eis-lab/monte-carlo-attention |
| Open Datasets | Yes | We test the performance with GLUE benchmark (Wang et al. 2018). To evaluate MCA with the Longformer model, we use the ar Xiv Academic Paper dataset (AAPD) (Yang et al. 2016), the IMDB review classification dataset and the Hyperpartisan News Detection (HND) dataset (Kiesel et al. 2019). |
| Dataset Splits | No | The paper mentions using standard benchmarks like GLUE but does not explicitly provide specific train/validation/test splits (e.g., percentages, sample counts, or citations to a specific split methodology) for reproducibility. |
| Hardware Specification | Yes | We employ AWS p3.2xlarge instances for the experiment, which serves one NVIDIA V100 GPU |
| Software Dependencies | No | MCA is implemented as a C++ Py Torch extension using custom CUDA kernels... and use the Huggingface library for implementations and download pre-trained weights for BERT, Distil BERT, and Longformer. The paper does not provide specific version numbers for PyTorch, CUDA, or Huggingface. |
| Experiment Setup | Yes | Given that α is an adjustable parameter that determines the attention error bound tightness, we evaluate for α = 0.2, 0.4, 0.6, and 1.0. We configure the Longformer with a window size of 256, and apply global attention for the CLS token. |