Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Mamba Modulation: On the Length Generalization of Mamba Models

Authors: Peng Lu, Jerry Huang, QIUHAO Zeng, Xinyu Wang, Boxing Chen, Philippe Langlais, Yufei CUI

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through experiments on standard long-context extension settings, such as long-context language modeling and passkey retrieval, we demonstrate empirically how scaling A is more effective compared to scaling δ, in the case of both Mamba and Mamba2 models. Results on a series of long-context generalization tasks show such an intuition holds empirically on Mamba models, highlighting the potential benefits of using A for length generalization.
Researcher Affiliation	Collaboration	Peng Lu Université de Montréal EMAIL Huang Université de Montréal & Mila Quebec AI Institute EMAIL Zeng Western University EMAIL Wang Mc Gill University EMAIL Chen Noah s Ark Lab EMAIL Langlais Université de Montréal EMAIL Cui Noah s Ark Lab EMAIL
Pseudocode	Yes	Algorithm 1 Mamba Extend methodology.1: Input: Model M, calibration set C and function CF 2: Output: Scaling factors S = [s1, . . . , s L] Rds L + 3: for i L do 4: si U(0, 1) 5: end for 6: S CF(S, C, M) 7: return S
Open Source Code	Yes	Our code is available at https://github.com/gnepul-ace/mamba_modulation.
Open Datasets	Yes	We evaluate language modeling perplexity on the Proof Pile dataset [40], following Peng et al. [96], across a varying number of context lengths. Figure 3 shows these perplexity results on a number of validation datasets, namely Proof Pile [40], PG19 [103] and Gov Report [59]. Evaluation is conducted on a set of ﬁxed lengths and depths to evaluate for both generalization ability as well as potential biases to relative location within the sequence. The exact setup follows from Ben-Kish et al. [9], in particular, the task comprises of a 5-digit code embedded at a random sequence depth within samples from the Wiki Text-103 dataset [89]. Long Bench [6] is a popular benchmark for testing the long-context abilities of LLMs, serving as a more suitable real-world benchmark on which we can explore how the scaling of A as opposed to t can inﬂuence performance.
Dataset Splits	No	To calibrate, 20 samples of the corresponding context length are used. For example, for a length of 16K, 20 samples of this length are used for the calibration of the set of si. Unlike the language modeling perplexity task however, we train the model on a training set. This training set contains samples of length 4096 corresponding to the task, where the objective is standard instruction-tuning [32].
Hardware Specification	Yes	All experiments were conducted on a single machine with 2 NVIDIA RTX4080 16GB GPUs.
Software Dependencies	Yes	Experiments were run in an environment using CUDA version 12.6 and Py Torch 2.6.0.
Experiment Setup	Yes	In our explicit implementation for calibrating scaling factors for A, we use the same hyperparameters as Azizi et al. [3]. Algorithm 2 Calibration via back-propagation 1: Input: Frozen model M, calibration set C, initial scaling factors S. Learning rate η, perturbation magnitude c, iterations K Algorithm 3 Calibration via zeroth-order optimization 1: Input: Frozen model M, calibration set C, perturbation magnitude c, iterations K we train a single scaling factor si R+ for every layer i in the model. For a L-layer model, this means L individual scaling factors are used. To calibrate, 20 samples of the corresponding context length are used.