Hungry Hungry Hippos: Towards Language Modeling with State Space Models
Authors: Daniel Y Fu, Tri Dao, Khaled Kamal Saab, Armin W Thomas, Atri Rudra, Christopher Re
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | H3 matches attention on the synthetic languages and comes within 0.4 PPL of Transformers on Open Web Text. Furthermore, a hybrid 125M-parameter H3-attention model that retains two attention layers surprisingly outperforms Transformers on Open Web Text by 1.0 PPL. ... FLASHCONV yields 2 speedup on the long-range arena benchmark and allows hybrid language models to generate text 2.4 faster than Transformers. ... achieving lower perplexity than Transformers and outperforming Transformers in zeroand few-shot learning on a majority of tasks in the Super GLUE benchmark. |
| Researcher Affiliation | Academia | Daniel Y. Fu , Tri Dao , Khaled K. Saab , Armin W. Thomas , Atri Rudra , Christopher R e Stanford University, University at Buffalo, SUNY {danfu,tridao}@cs.stanford.edu, {ksaab,athms}@stanford.edu, atri@buffalo.edu, chrismre@cs.stanford.edu |
| Pseudocode | Yes | Algorithm 1 H3 Layer and Algorithm 2 State Passing Algorithm |
| Open Source Code | Yes | Code for H3 is available at https://github.com/Hazy Research/H3. |
| Open Datasets | Yes | Open Web Text (Gokaslan et al., 2019), the Pile (Gao et al., 2020), Wiki Text-103 (Merity et al., 2016) and Super GLUE benchmark. |
| Dataset Splits | Yes | We randomly select 0.5% of the dataset as the validation set, with the rest being used as training set. |
| Hardware Specification | Yes | All models were trained on either a single 16x A100-40GB node or a cluster of 8x A100-80GB nodes. |
| Software Dependencies | No | We run all implementations with mixed-precision training (Py Torch AMP). and We use the Adam W optimizer |
| Experiment Setup | Yes | We use an effective batch size of 512, and use gradient accumulation... We use the Adam W optimizer, with learning rate 6e-4 for GPT-2 small and 1.5e-4 for GPT-2 medium, and weight decay of 0.1. All models are trained with the same hyperparameters for 100K steps. We train models with sequence length 1024. |