Hyena Hierarchy: Towards Larger Convolutional Language Models

Authors: Michael Poli, Stefano Massaroli, Eric Nguyen, Daniel Y Fu, Tri Dao, Stephen Baccus, Yoshua Bengio, Stefano Ermon, Christopher Re

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In recall and reasoning tasks on sequences of thousands to hundreds of thousands of tokens, Hyena improves accuracy by more than 50 points over operators relying on state-spaces and other implicit and explicit methods, matching attention-based models. We set a new state-of-the-art for dense-attention-free architectures on language modeling in standard datasets (WIKITEXT103 and THE PILE), reaching Transformer quality with a 20% reduction in training compute required at sequence length 2K.
Researcher Affiliation Academia 1Stanford University 2Mila and Université de Montréal.
Pseudocode Yes Algorithm 1 Projection; Algorithm 2 Hyena Filter; Algorithm 3 Forward pass of Hyena
Open Source Code Yes An implementation of Hyena can be found at this link.
Open Datasets Yes We evaluate the perplexity on WIKITEXT103 (Table A.2) and THE PILE (Table 4.2). On the THE PILE, we train different models for 5, 10, 15 billion tokens (different runs), adjusting the learning rate scheduler. We also report results of additional training runs on other datasets. We train a Hyena 153M model on the standard PG-19 long-range corpus (Rae et al., 2019)... We experiment on sequential CIFAR, where pixels are flattened as a sequence... On Image Net, we drop-in replace attention layers in the Vision Transformer (Vi T) (Dosovitskiy et 2020) with the Hyena operator (without changes from its language counterpart) and match performance with Vi T.
Dataset Splits No The paper uses standard datasets like WIKITEXT103, THE PILE, and ImageNet-1k and mentions training and testing, and also refers to examples from the validation set for Multi RC, but it does not specify the exact percentages or counts for train/validation/test splits, nor does it cite a predefined split with explicit details. It states 'standard datasets' but no specific split details are given.
Hardware Specification Yes All models are trained on a single node of 8 A100 80GB GPUs. We train from scratch with no outside data on 8 Nvidia A100 GPUs. We use basic image augmentations, 0.1 dropout, 0.03 weight decay and train for 100 epochs using a Nvidia T4 GPU.
Software Dependencies No The paper mentions software such as PyTorch and Flash Attention, but it does not provide specific version numbers for any of the software components used in their experiments. For example, it mentions “standard attention implementation in Py Torch” without specifying the PyTorch version.
Experiment Setup Yes Table 4. (Hyperparameter settings for reasoning and in-context learning tasks.). Table 6. Hyperparameter settings for THE PILE, 125M). Table 11. Vi T and Vi T-Hyena settings for Image Net-1k).