SMYRF - Efficient Attention using Asymmetric Clustering

Authors: Giannis Daras, Nikita Kitaev, Augustus Odena, Alexandros G. Dimakis

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We apply our method to pre-trained state-of-the-art Natural Language Processing and Computer Vision models and we report significant memory and speed benefits. Notably, SMYRF-BERT outperforms (slightly) BERT on GLUE, while using 50% less memory. We also show that SMYRF can be used interchangeably with dense attention before and after training. Finally, we use SMYRF to train GANs with attention in high resolutions. Using a single TPU, we were able to scale attention to 128x128=16k and 256x256=65k tokens on Big GAN on Celeb A-HQ.
Researcher Affiliation Collaboration Giannis Daras Computer Science Department The University of Texas at Austin giannisdaras@utexas.edu Augustus Odena Google Research augustusodena@google.com Nikita Kitaev Google Research kitaev@cs.berkeley.edu Alexandros G. Dimakis ECE Department The University of Texas at Austin dimakis@austin.utexas.edu
Pseudocode No The paper describes the steps of the SMYRF algorithm in text (e.g., "Our algorithm consists of the following steps:") but does not provide a formal pseudocode block or algorithm figure.
Open Source Code Yes We open-source our code and pre-trained models to encourage more related research: https://github.com/giannisdaras/smyrf.
Open Datasets Yes We use a pre-trained2 Big GAN, which is a state-of-the-art model in Image Generation for Image Net [37]. ... We train SMYRF-BERT (base) on GLUE [25, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51] benchmark, using sequence length 128.
Dataset Splits No The paper refers to using "GLUE (dev)" and mentions various dataset names for training and evaluation but does not explicitly provide percentages or sample counts for train/validation/test splits, nor does it refer to specific predefined splits with citations for reproducibility beyond dataset names.
Hardware Specification Yes Using a single TPU, we were able to scale attention to 128x128=16k and 256x256=65k tokens on Big GAN on Celeb A-HQ. ... Finally, we successfully train a Big GAN with attention at resolution 256 256 on a single v3-8 TPU.
Software Dependencies No The paper mentions "Py Torch [36]" for the pre-trained Big GAN model but does not specify version numbers for PyTorch or any other software dependencies crucial for replication.
Experiment Setup Yes We train SMYRF-BERT (base) on GLUE [25,...] benchmark, using sequence length 128. ... We also run experimetns with SMYRF-BERT large to a subset of the GLUE tasks. ... We also experiment on the IMDB [52] dataset, using sequence length 512 tokens3. ... For SMYRF models, we train and evaluate with SMYRF. ... We choose Celeba-HQ because: (i) images are in resolution higher than 128 128, (ii) our budget is limited and Celeba-HQ requires much less training steps compared to Image Net [37]. With SMYRF, we move attention from 64 64 resolution to 128 128 and train with 50% less memory than dense attention. ... We conduct our experiments on Celeba-HQ-128 for 120K steps.