The Hedgehog & the Porcupine: Expressive Linear Attentions with Softmax Mimicry

Authors: Michael Zhang, Kush Bhatia, Hermann Kumbong, Christopher Re

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments show Hedgehog recovers over 99% of standard Transformer quality in train-from-scratch and finetunedconversion settings, outperforming prior linear attentions up to 6 perplexity points on Wiki Text-103 with causal GPTs, and up to 8.7 GLUE score points on finetuned bidirectional BERTs.
Researcher Affiliation Academia Michael Zhang, Kush Bhatia, Hermann Kumbong, and Christopher R e Department of Computer Science, Stanford University {mzhang,kushb,chrismre}@cs.stanford.edu, kumboh@stanford.edu,
Pseudocode Yes We include further implementation details and pseudocode in Appendix A. Pytorch-like code is given below.
Open Source Code No The paper includes pseudocode snippets and discusses implementation using Hugging Face Transformers, but does not provide a concrete link to an open-source repository for the full methodology.
Open Datasets Yes training Transformer models with linear attention with the goal of matching standard Transformer performance, e.g., as tested on benchmarks such as Long Range Arena (LRA) classification (Tay et al., 2021) and Wiki Text-103 language modeling (Merity et al., 2017). ...Corpus of Linguistic Acceptability (Co LA) task (Warstadt et al., 2019)... SAMSum summarization (Gliwa et al., 2019).
Dataset Splits Yes We generate 10,000 training samples following the patterns described in Table 12, and evaluate on 2000 newly-generated test samples (again using the same associative recall structure, but with different token associations). ... explicitly stopping training if validation loss stops decreasing after 10 epochs).
Hardware Specification Yes For all experiments, we use non-quantized model weights in bfloat16, and conduct all training runs and evaluations on a single A6000 GPU.
Software Dependencies No The paper mentions using PyTorch and Hugging Face Transformers, but does not specify their version numbers or other software dependencies with versions.
Experiment Setup Yes For Wiki Text-103, we train a 125M parameter GPT-2 style Transformer with learning rate 6e-4, weight decay 0.01, and Adam W optimizer. ... We train with batch size 8, learning rate 1e-5, zero weight decay, Adam W optimizer, and up to 10 epochs with early stopping.