reproducibilityindex.ai

State-Free Inference of State-Space Models: The Transfer Function Approach

Authors: Rom Parnichkun, Stefano Massaroli, Alessandro Moro, Jimmy T.H. Smith, Ramin Hasani, Mathias Lechner, Qi An, Christopher Re, Hajime Asama, Stefano Ermon, Taiji Suzuki, Michael Poli, Atsushi Yamashita

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experimental results across multiple sequence lengths and state sizes illustrates, on average, a 35% training speed improvement over S4 layers parametrized in time-domain on the Long Range Arena benchmark, while delivering state-of-the-art downstream performances over other attention-free approaches. In order to validate the proposed parametrization, we conducted experiments across a range of tasks, models, and importantly state sizes, including Long Range Arena (LRA), language modeling, and synthetic tasks.
Researcher Affiliation	Collaboration	1The University of Tokyo 2Liquid AI 3RIKEN 4Stanford University 5Massachusetts Institute of Technology.
Pseudocode	Yes	Algorithm 1 RTF Kernel Generation
Open Source Code	Yes	Our code is available at https://github.com/ruke1ire/RTF.
Open Datasets	Yes	We conducted these experiments on RTF along with S4, S4D, and Space Time (Zhang et al., 2023) as presented in Table 1. The Long Range Arena (LRA) benchmark... List Ops An extended dataset introduced by (Nangia & Bowman, 2018). IMDB Sentiment dataset from (Maas et al., 2011). Retrieval This is derived from the ACL Anthology network corpus introduced by (Radev et al., 2009). Image The task utilizes the CIFAR-10 dataset introduced by (Krizhevsky, 2009). Pathfinder This is derived from the Pathfinder challenge, as presented by (Linsley et al., 2018). The Copying task, akin to (Arjovsky et al., 2016). The Delay task, which was also used to ablate Hi PPO SSM initialization schemes (Gu et al., 2023). trained on The Pile (Gao et al., 2021). trained on the well-established Wiki Text-103 dataset.
Dataset Splits	Yes	The dataset contains 96,000 training, 2,000 validation, and 2,000 test sequences. The dataset consists of 25,000 training and 25,000 test examples. The dataset comprises 147,086 training pairs, 18,090 validation pairs, and 17,437 test pairs. The dataset comprises 45,000 training examples, 5,000 validation examples, and 10,000 test examples. The dataset includes 160,000 training examples, 20,000 validation examples, and 20,000 test examples. Each model was given 10k training samples for 50 epochs, and was tested with 1000 unseen samples.
Hardware Specification	Yes	Experiments were conducted using JAX (Bradbury et al., 2018) on a single A100 80GB GPU for the memory profiling experiments, and on a single H100 80GB GPU for the latency profiling experiments.
Software Dependencies	No	The paper mentions software like JAX, PyKeops (for S4/S4D), Adam W optimizer, GELU activation, and Xavier initialization, but it does not provide specific version numbers for these software components, which is required for reproducibility.
Experiment Setup	Yes	Table 8: Table with the hyperparameters used for classification datasets. Table 9: Table with the layer hyperparameters used for classification datasets. Table 10: Copying Task Hyperparameters. Table 11: Delay Task Hyperparameters. Table 13: Wikitext103 Hyperparameters. We also made slight modifications to the Hyena operator s output linear projection, by inserting an additional lowrank linear layer and a GELU (Hendrycks & Gimpel, 2023) activation, before the final output linear projection. Therefore, we instead adopt the Xavier initialization (Glorot & Bengio, 2010) over the rational function coefficients and apply the Montel constraint via an ℓ1 penalization as shown in Section B.2.