Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Fixed-Point RNNs: Interpolating from Diagonal to Dense

Authors: Sajad Movahedi, Felix Sarnthein, Nicola Muca Cirone, Antonio Orvieto

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, we investigate parameterizations of a large class of dense linear RNNs as fixed-points of parallelizable diagonal linear RNNs. The resulting models can naturally trade expressivity for efficiency at a fixed number of parameters and achieve state-of-the-art results on the state-tracking benchmarks A5 and S5, while matching performance on copying and other tasks.
Researcher Affiliation Academia 1ELLIS Institute Tuebingen, 2Max Planck Institute for Intelligent Systems, 3Department of Mathematics, Imperial College London EMAIL
Pseudocode No The paper includes mathematical equations and descriptions of algorithms but does not contain a clearly labeled 'Pseudocode' or 'Algorithm' block.
Open Source Code Yes The code is available at github.com/dr-faustus/fp-rnn.
Open Datasets Yes The task of tracking state in the alternating group on five elements (A5) is one of the tasks introduced in (Merrill et al., 2024) to show that linear RNNs and SSMs cannot solve state-tracking problems. A5 is the simplest subset of S5, the word problem involving tracking the permutation of five elements. We use the copy task (Jelassi et al., 2024) in order to assess the memory capabilities of FP-Mamba. For the language modeling task, we use the implmentation provided by Ajroldi (2024). We use a train subsample of the Fine Web dataset (Penedo et al., 2024) with 2B tokens, and a validation subsample with 200K tokens. We also evaluate FP-Mamba on the remaining unsolved task of the Chomsky Hierarchy of language problems introduced by Beck et al. (2024). Specifically, we focus on the mod arithmetic task with brackets. In order to investigate the state-tracking ability of the fixed-point framework in a natural language setting, we perform experiments on the catb Ab I dataset (Schlag et al., 2021b). catb Ab I (concatenatedb Ab I) is a reprocessing of the b Ab I QA benchmark (Weston et al., 2015).
Dataset Splits Yes State tracking. We train all models for 5 epochs, with a batch size of 512, 3 different random seeds, learning rate set to 0.0001, weight decay set to 0.01, gradient clipping 1.0, and the Adam W optimizer (Loshchilov & Hutter, 2017). For the train data, we sample 16M datapoints from all the possible permutations for a sequence length of 16, and split the data with a ratio of 4 to 1 for train and validation samples. For the test data, we sample 500k sequences of length 50. We train the model for sequence length 16 on the train sample, and evaluate for sequence lengths 2 through 50 on the test sample. Copying. We train all models for 10000 iterations, batch size 128, 3 different random seeds, learning rate 0.00001, weight decay 0.1, gradient clipping 1.0, the Adam W optimizer, and with linear learning rate decay after a 300 iterations warmup. The data is sampled randomly at the start of the training/evaluation. We use a vocab size of 29, a context length of 256, and train the model for copy sequence length in the range 5 to 50, and evaluate for the range 5 to 100. Mod arithmetic. Our models are trained for 100000 iterations, batch size 256, learning rate 0.001, weight decay 0.1, and no gradient clipping. The learning rate is decayed using a cosine scheduling by a factor of 0.001 after 10000 iterations of warmup. The data is randomly sampled at the start of training/evaluation. We use a vocab size of 12, with context length 256, and train data sequence length in the range 3 to 40, and the test/evaluation data in the range 40 to 256.
Hardware Specification Yes Training Time on A5. In order to compare the proposed model to the baselines in terms of computation time, we train all of the baselines and our proposed model using the same hardware (A100-80GB gpus) on the A5 task. Language Modeling. ...training on 4 A100-80GB GPUs with 4 accumulation steps, which is the batchsize used in the 2.5B setting in (Gu & Dao, 2024).
Software Dependencies No The paper mentions using PyTorch in the context of the Ajroldi (2024) implementation for language modeling, but does not provide specific version numbers for PyTorch or other key software dependencies used across all experiments.
Experiment Setup Yes State tracking. We train all models for 5 epochs, with a batch size of 512, 3 different random seeds, learning rate set to 0.0001, weight decay set to 0.01, gradient clipping 1.0, and the Adam W optimizer (Loshchilov & Hutter, 2017). Copying. We train all models for 10000 iterations, batch size 128, 3 different random seeds, learning rate 0.00001, weight decay 0.1, gradient clipping 1.0, the Adam W optimizer, and with linear learning rate decay after a 300 iterations warmup. Mod arithmetic. Our models are trained for 100000 iterations, batch size 256, learning rate 0.001, weight decay 0.1, and no gradient clipping. The learning rate is decayed using a cosine scheduling by a factor of 0.001 after 10000 iterations of warmup. Language Modeling. We use a batchsize of 16 4 4 = 256, training on 4 A100-80GB GPUs with 4 accumulation steps, which is the batchsize used in the 2.5B setting in (Gu & Dao, 2024). The learning rate is optimized for the Mamba model (0.004) and train all models with this learning rate, with cosine warmup with 0.1 steps. We use the Adam W optimizer with weight decay set to 0.1 and β1, β2 set to 0.9, 0.95.