Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Hungry Hungry Hippos: Towards Language Modeling with State Space Models

Authors: Daniel Y Fu, Tri Dao, Khaled Kamal Saab, Armin W Thomas, Atri Rudra, Christopher Re

ICLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental H3 matches attention on the synthetic languages and comes within 0.4 PPL of Transformers on Open Web Text. Furthermore, a hybrid 125M-parameter H3-attention model that retains two attention layers surprisingly outperforms Transformers on Open Web Text by 1.0 PPL. ... FLASHCONV yields 2 speedup on the long-range arena benchmark and allows hybrid language models to generate text 2.4 faster than Transformers. ... achieving lower perplexity than Transformers and outperforming Transformers in zeroand few-shot learning on a majority of tasks in the Super GLUE benchmark.
Researcher Affiliation Academia Daniel Y. Fu , Tri Dao , Khaled K. Saab , Armin W. Thomas , Atri Rudra , Christopher R e Stanford University, University at Buffalo, SUNY EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode Yes Algorithm 1 H3 Layer and Algorithm 2 State Passing Algorithm
Open Source Code Yes Code for H3 is available at https://github.com/Hazy Research/H3.
Open Datasets Yes Open Web Text (Gokaslan et al., 2019), the Pile (Gao et al., 2020), Wiki Text-103 (Merity et al., 2016) and Super GLUE benchmark.
Dataset Splits Yes We randomly select 0.5% of the dataset as the validation set, with the rest being used as training set.
Hardware Specification Yes All models were trained on either a single 16x A100-40GB node or a cluster of 8x A100-80GB nodes.
Software Dependencies No We run all implementations with mixed-precision training (Py Torch AMP). and We use the Adam W optimizer
Experiment Setup Yes We use an effective batch size of 512, and use gradient accumulation... We use the Adam W optimizer, with learning rate 6e-4 for GPT-2 small and 1.5e-4 for GPT-2 medium, and weight decay of 0.1. All models are trained with the same hyperparameters for 100K steps. We train models with sequence length 1024.