State-Free Inference of State-Space Models: The *Transfer Function* Approach

Authors: Rom Parnichkun, Stefano Massaroli, Alessandro Moro, Jimmy T.H. Smith, Ramin Hasani, Mathias Lechner, Qi An, Christopher Re, Hajime Asama, Stefano Ermon, Taiji Suzuki, Michael Poli, Atsushi Yamashita

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experimental results across multiple sequence lengths and state sizes illustrates, on average, a 35% training speed improvement over S4 layers parametrized in time-domain on the Long Range Arena benchmark, while delivering state-of-the-art downstream performances over other attention-free approaches. In order to validate the proposed parametrization, we conducted experiments across a range of tasks, models, and importantly state sizes, including Long Range Arena (LRA), language modeling, and synthetic tasks.
Researcher Affiliation Collaboration 1The University of Tokyo 2Liquid AI 3RIKEN 4Stanford University 5Massachusetts Institute of Technology.
Pseudocode Yes Algorithm 1 RTF Kernel Generation
Open Source Code Yes Our code is available at https://github.com/ruke1ire/RTF.
Open Datasets Yes We conducted these experiments on RTF along with S4, S4D, and Space Time (Zhang et al., 2023) as presented in Table 1. The Long Range Arena (LRA) benchmark... List Ops An extended dataset introduced by (Nangia & Bowman, 2018). IMDB Sentiment dataset from (Maas et al., 2011). Retrieval This is derived from the ACL Anthology network corpus introduced by (Radev et al., 2009). Image The task utilizes the CIFAR-10 dataset introduced by (Krizhevsky, 2009). Pathfinder This is derived from the Pathfinder challenge, as presented by (Linsley et al., 2018). The Copying task, akin to (Arjovsky et al., 2016). The Delay task, which was also used to ablate Hi PPO SSM initialization schemes (Gu et al., 2023). trained on The Pile (Gao et al., 2021). trained on the well-established Wiki Text-103 dataset.
Dataset Splits Yes The dataset contains 96,000 training, 2,000 validation, and 2,000 test sequences. The dataset consists of 25,000 training and 25,000 test examples. The dataset comprises 147,086 training pairs, 18,090 validation pairs, and 17,437 test pairs. The dataset comprises 45,000 training examples, 5,000 validation examples, and 10,000 test examples. The dataset includes 160,000 training examples, 20,000 validation examples, and 20,000 test examples. Each model was given 10k training samples for 50 epochs, and was tested with 1000 unseen samples.
Hardware Specification Yes Experiments were conducted using JAX (Bradbury et al., 2018) on a single A100 80GB GPU for the memory profiling experiments, and on a single H100 80GB GPU for the latency profiling experiments.
Software Dependencies No The paper mentions software like JAX, PyKeops (for S4/S4D), Adam W optimizer, GELU activation, and Xavier initialization, but it does not provide specific version numbers for these software components, which is required for reproducibility.
Experiment Setup Yes Table 8: Table with the hyperparameters used for classification datasets. Table 9: Table with the layer hyperparameters used for classification datasets. Table 10: Copying Task Hyperparameters. Table 11: Delay Task Hyperparameters. Table 13: Wikitext103 Hyperparameters. We also made slight modifications to the Hyena operator s output linear projection, by inserting an additional lowrank linear layer and a GELU (Hendrycks & Gimpel, 2023) activation, before the final output linear projection. Therefore, we instead adopt the Xavier initialization (Glorot & Bengio, 2010) over the rational function coefficients and apply the Montel constraint via an ℓ1 penalization as shown in Section B.2.