Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

TF-MAS: Training-free Mamba2 Architecture Search

Authors: Yi Fan, Yu-Bin Yang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results show that our method outperforms all existing training-free NAS approaches in terms of both ranking correlation and the performance of search results for Mamba2 architecture.
Researcher Affiliation	Academia	Yi Fan State Key Laboratory for Novel Software Technology Nanjing University Nanjing 210023, China EMAIL Yu-Bin Yang State Key Laboratory for Novel Software Technology Nanjing University Nanjing 210023, China EMAIL
Pseudocode	Yes	We presents the pseudocode of our proxy computation process in Section D of the technical appendix. Algorithm 1 Proxy of TF-MAS
Open Source Code	Yes	Our codes are available at https://github.com/fanyi-plus/tf-nas.
Open Datasets	Yes	To our knowledge, current NASBenches encompass only CNN [65, 17, 57, 16] and Transformer [62], with no one specific to Mamba. Therefore, we will sample architectures from both SSMamba2 and VWSSMamba2 to construct NASBenches pertaining to Mamba. We evaluate accuracy (where higher is better) on datasets LAMBADA [46], Hella Swag [66], PIQA [7], Arc-E [8], Arc-C [8], Wino Grande [53], and Openbook QA [45], while for LAMBADA, we also assess perplexity (where lower is better). fine-tuning using the Pile dataset.
Dataset Splits	No	The paper uses several public datasets (LAMBADA, Hella Swag, PIQA, Arc-E, Arc-C, Wino Grande, Openbook QA, Pile dataset) for evaluation and fine-tuning. However, it does not explicitly provide specific training/test/validation split percentages, sample counts, or citations to predefined splits for its experimental runs. For instance, for the Pile dataset, it only mentions
Hardware Specification	Yes	Our search is conducted on 4 NVIDIA Tesla V100 GPUs, with search times on SSMamba2 and VWSSMamba2 being 0.7 day and 0.6 day, respectively.
Software Dependencies	No	The paper does not explicitly list specific software dependencies with version numbers.
Experiment Setup	Yes	The design of the search space generally follows the design principles outlined in [10]. Specifically, we have set the following 4 AHs: Depth (D): the number of Mamba2 blocks. Width (W): the dimension of each token. State dimension (N): the dimensionality of the state space. Number of heads (H): the number of heads in each Mamba2 block. All 4 AHs are positive integers. We set the dimension of each head to 64, which is consistent with the original Mamba2. During the search for opt Mamba2 and opt VWMamba2, our search method employs an evolutionary approach with a population size of 50, iterating for 300 generations. For each sampled network, we perform 10 epoches of fine-tuning using the Pile dataset. Under this condition, the feasible values for each AH are as follows: D: 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24 (13 values) W: 384, 416, 448, 480, 512, 544, 576, 608, 640, 672, 704, 736, 768 (13 values) N: 64, 72, 80, 88, 96, 104, 112, 120, 128 (9 values) H: 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24 (13 values)