Hungry Hungry Hippos: Towards Language Modeling with State Space Models

Authors: Daniel Y Fu, Tri Dao, Khaled Kamal Saab, Armin W Thomas, Atri Rudra, Christopher Re

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental H3 matches attention on the synthetic languages and comes within 0.4 PPL of Transformers on Open Web Text. Furthermore, a hybrid 125M-parameter H3-attention model that retains two attention layers surprisingly outperforms Transformers on Open Web Text by 1.0 PPL. ... FLASHCONV yields 2 speedup on the long-range arena benchmark and allows hybrid language models to generate text 2.4 faster than Transformers. ... achieving lower perplexity than Transformers and outperforming Transformers in zeroand few-shot learning on a majority of tasks in the Super GLUE benchmark.
Researcher Affiliation Academia Daniel Y. Fu , Tri Dao , Khaled K. Saab , Armin W. Thomas , Atri Rudra , Christopher R e Stanford University, University at Buffalo, SUNY {danfu,tridao}@cs.stanford.edu, {ksaab,athms}@stanford.edu, atri@buffalo.edu, chrismre@cs.stanford.edu
Pseudocode Yes Algorithm 1 H3 Layer and Algorithm 2 State Passing Algorithm
Open Source Code Yes Code for H3 is available at https://github.com/Hazy Research/H3.
Open Datasets Yes Open Web Text (Gokaslan et al., 2019), the Pile (Gao et al., 2020), Wiki Text-103 (Merity et al., 2016) and Super GLUE benchmark.
Dataset Splits Yes We randomly select 0.5% of the dataset as the validation set, with the rest being used as training set.
Hardware Specification Yes All models were trained on either a single 16x A100-40GB node or a cluster of 8x A100-80GB nodes.
Software Dependencies No We run all implementations with mixed-precision training (Py Torch AMP). and We use the Adam W optimizer
Experiment Setup Yes We use an effective batch size of 512, and use gradient accumulation... We use the Adam W optimizer, with learning rate 6e-4 for GPT-2 small and 1.5e-4 for GPT-2 medium, and weight decay of 0.1. All models are trained with the same hyperparameters for 100K steps. We train models with sequence length 1024.