Hydra: Bidirectional State Space Models Through Generalized Matrix Mixers

Authors: Sukjun Hwang, Aakash Sunil Lahoti, Ratish Puduppully, Tri Dao, Albert Gu

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We provide extensive experimental results that substantiate our claims. Our systematic ablation studies control architectural variables to highlight the impact of matrix parameterization. These careful experiments confirm that Sequence Alignment, a property we newly identified in certain matrix mixers, significantly enhances downstream performance.
Researcher Affiliation Collaboration 1Machine Learning Department, Carnegie Mellon University 2IT University of Copenhagen 3Department of Computer Science, Princeton University 4Cartesia AI {sukjunh,alahoti}@cs.cmu.edu, rapu@itu.dk, tri@tridao.me, agu@cs.cmu.edu
Pseudocode Yes Figure 5: Pseudo code for Hydra. B,L,H,P denote batch size, sequence length, number of heads, and head dimension respectively. The suffices _f and _b denote forward and backward.
Open Source Code Yes We publicly release source code at https://github.com/goombalab/hydra.
Open Datasets Yes We pretrain our models on the masked language modeling objective using the Colossal Cleaned Common Crawl (C4) corpus [36], then finetune and evaluate them on the GLUE benchmark [43].
Dataset Splits Yes We pretrain our models on the masked language modeling objective using the Colossal Cleaned Common Crawl (C4) corpus [36], then finetune and evaluate them on the GLUE benchmark [43].
Hardware Specification No This research was made possible by the generous support of computational resources provided by Cartesia AI.
Software Dependencies No BERT trained with the latest Hugging Face recipe [46]
Experiment Setup Yes The specific hyperparameters for reproducing the results in Table 4 are reported in Table 6, and the settings used for obtaining the results of Hydra in Table 5 are listed in Table 9.