Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Routing Mamba: Scaling State Space Models with Mixture-of-Experts Projection

Authors: Zheng Zhan, Liliang Ren, Shuohang Wang, Liyuan Liu, Yang Liu, Yeyun Gong, Yanzhi Wang, yelong shen

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results further show Ro M effectively scales hybrid language models, yielding a 23% FLOPs saving compared to dense Mamba scaling for similar performance. We conduct experiments across models of varying scales, encompassing different parameter sizes and architectures. Figure 3 presents the PPL results on the validation dataset, comparing Ro M against a standard Mamba model across different training sequence lengths (4K, 8K, and 16K).
Researcher Affiliation	Collaboration	1Microsoft 2Northeastern University
Pseudocode	No	The paper describes methodologies using mathematical formulations and prose, but does not present structured pseudocode or algorithm blocks.
Open Source Code	Yes	We release our training codebase at https://github.com/zhanzheng8585/Routing-Mamba.
Open Datasets	Yes	Following the settings outlined in [39], we report perplexity results on the Slim Pajama [43] dataset, observing minimal fluctuation across evaluations. We evaluate task performance on several common sense reasoning datasets, including LAMBADA [35], Hella Swag [48], PIQA [2], ARC-Easy [4], ARC-Challenge [4], and Wino Grande [40].
Dataset Splits	No	By default, the models in this section are trained on Slim Pajama using 20B tokens and a sequence length of 4K, unless stated otherwise. Figure 3: Validation Perplexity (PPL) on the Slim Pajama validation set (with 20B tokens pretrained) with different Mo E strategies.
Hardware Specification	Yes	Training speed is measured using 8 A100 GPUs. 20B training tokens on 8 A100 GPUs.
Software Dependencies	No	We choose Py Torch Fully Sharded Data Parallel (FSDP) with CPU offloading as a scalable framework for all the model training. For Mo E computation without expert parallelism, we find the Megablocks [14] package to be helpful. Our model is optimized using the Adam W optimizer.
Experiment Setup	Yes	Our model is optimized using the Adam W optimizer with beta1 = 0.9 and beta2 = 0.95, with gradient clipping set to 1.0 and weight decay to 0.1. A cosine learning rate schedule is employed, with a maximum learning rate of 4e-4 for models and a warmup ratio of 0.01. We train our models with a global batch size of 2 million tokens, unless stated otherwise. By default, the models in this section are trained on Slim Pajama using 20B tokens and a sequence length of 4K, unless stated otherwise.