Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Routing Mamba: Scaling State Space Models with Mixture-of-Experts Projection
Authors: Zheng Zhan, Liliang Ren, Shuohang Wang, Liyuan Liu, Yang Liu, Yeyun Gong, Yanzhi Wang, yelong shen
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results further show Ro M effectively scales hybrid language models, yielding a 23% FLOPs saving compared to dense Mamba scaling for similar performance. We conduct experiments across models of varying scales, encompassing different parameter sizes and architectures. Figure 3 presents the PPL results on the validation dataset, comparing Ro M against a standard Mamba model across different training sequence lengths (4K, 8K, and 16K). |
| Researcher Affiliation | Collaboration | 1Microsoft 2Northeastern University |
| Pseudocode | No | The paper describes methodologies using mathematical formulations and prose, but does not present structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | We release our training codebase at https://github.com/zhanzheng8585/Routing-Mamba. |
| Open Datasets | Yes | Following the settings outlined in [39], we report perplexity results on the Slim Pajama [43] dataset, observing minimal fluctuation across evaluations. We evaluate task performance on several common sense reasoning datasets, including LAMBADA [35], Hella Swag [48], PIQA [2], ARC-Easy [4], ARC-Challenge [4], and Wino Grande [40]. |
| Dataset Splits | No | By default, the models in this section are trained on Slim Pajama using 20B tokens and a sequence length of 4K, unless stated otherwise. Figure 3: Validation Perplexity (PPL) on the Slim Pajama validation set (with 20B tokens pretrained) with different Mo E strategies. |
| Hardware Specification | Yes | Training speed is measured using 8 A100 GPUs. 20B training tokens on 8 A100 GPUs. |
| Software Dependencies | No | We choose Py Torch Fully Sharded Data Parallel (FSDP) with CPU offloading as a scalable framework for all the model training. For Mo E computation without expert parallelism, we find the Megablocks [14] package to be helpful. Our model is optimized using the Adam W optimizer. |
| Experiment Setup | Yes | Our model is optimized using the Adam W optimizer with beta1 = 0.9 and beta2 = 0.95, with gradient clipping set to 1.0 and weight decay to 0.1. A cosine learning rate schedule is employed, with a maximum learning rate of 4e-4 for models and a warmup ratio of 0.01. We train our models with a global batch size of 2 million tokens, unless stated otherwise. By default, the models in this section are trained on Slim Pajama using 20B tokens and a sequence length of 4K, unless stated otherwise. |