Laughing Hyena Distillery: Extracting Compact Recurrences From Convolutions

Authors: Stefano Massaroli, Michael Poli, Dan Fu, Hermann Kumbong, Rom Parnichkun, David Romero, Aman Timalsina, Quinn McIntyre, Beidi Chen, Atri Rudra, Ce Zhang, Christopher Ré, Stefano Ermon, Yoshua Bengio

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Pretraining: We pretrain a suite of Multi Hyena language models on The Pile [11], investigating scaling of perplexity with different amounts of total tokens (5, 10, 15 billion), as well as larger training runs for 300 billion tokens. Multi Hyena outperforms Transformers and Hyena. Distillation analysis: We investigate the relation between optimal distillation orders, Hankel spectrum, and errors on the logits of distilled models. Post-distillation downstreams: We evaluate the downstream impact of distilling long convolutional language models, reporting HELM [41] and LM-Eval-Harness [42] results. Benchmarking: We benchmark latency, throughput and memory along the different axes of batch size, sequence length, number of generated tokens. We include base models, distilled models and equivalent Transformers.
Researcher Affiliation Collaboration 1Mila and Université de Montréal. 2Stanford University. 3The University of Tokyo. 4Purdue University. 5Vrije Universiteit Amsterdam. 6Carnegie Mellon University and Meta AI (FAIR). 7University of Buffalo, SUNY. 8University of Chicago and Together Computer.
Pseudocode Yes Algorithm 1 Hyena Require: Input sequence u RL D from the previous layer, long convolution filter Th, number of heads M.
Open Source Code No The paper does not explicitly state that source code for the methodology is released or provide a link to a code repository.
Open Datasets Yes We pretrain a suite of Multi Hyena language models on The Pile [11]
Dataset Splits No The paper mentions training on 'The Pile' and evaluating on 'LM-Eval Harness' and 'HELM' tasks, following 'the setup of [2]', but does not explicitly specify train/validation/test split percentages or sample counts for its experiments.
Hardware Specification Yes All experiments are carried out on a NVIDIA A100 with 80GB in float16 precision.
Software Dependencies No The paper mentions the use of 'ADAMW [58] optimizer' but does not provide specific version numbers for any software dependencies, libraries, or frameworks used in the experiments.
Experiment Setup Yes All Multi Hyena models are set to 8 heads, and otherwise use the same hyperparameters of Hyena models of equivalent size. We set weight decay of Hyena filter parameters to 0, and lower the frequency of sine activations in the implicit MLP to 4. We follow the setup of [2], and first train models for 5, 10 and 15 billion tokens, adjusting the learning rate scheduler accordingly. Then, we train for 300 billion tokens. ...To optimize the parameters of the modal form, we use gradient-based optimization and minimize the ℓ2 discrepancy between filters in time domain. In particular, we use the ADAMW [58] optimizer with learning rate 3 10 4, and a cosine annealing decay schedule down to 10 6 after 30 thousand iterations.