Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

SAFEx: Analyzing Vulnerabilities of MoE-Based LLMs via Stable Safety-critical Expert Identification

Authors: Zhenglin Lai, Mengyao Liao, Bingzhe Wu, Dong Xu, Zebin Zhao, Zhihang Yuan, Chao Fan, Jianqiang Li

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Applying SAFEX to mainstream Mo E-based LLMs (including Mixtral-8x7B-Instruct-v0.1, Qwen1.5-Mo E-A2.7B-Chat, deepseek-moe-16b-chat, and recently released Qwen3-30B-A3B), we empirically demonstrate the prevalence of positional vulnerabilities and identify safety-critical experts whose perturbation at the single-expert level significantly compromises overall model safety.
Researcher Affiliation	Collaboration	1 School of Artificial Intelligence, Shenzhen University 2 Byte Dance Inc.
Pseudocode	No	The paper describes the SAFEX framework with workflow diagrams (Figure 1, Figure 2) and algorithmic steps in text (e.g., Section 2.2 for Stability-based Expert Selection), but it does not present structured pseudocode or algorithm blocks.
Open Source Code	Yes	These results establish positional vulnerability as a distinct Mo E-specific safety challenge and provide a practical, computeefficient pathway for expert-level safety interventions within routed architectures (https://github.com/Bearisbug/SAFEx).
Open Datasets	Yes	To ensure comprehensive coverage, we uniformly sample harmful prompts from multiple predefined harmful categories (e.g., fraud, health consultation, illegal activities) from existing different benchmark datasets [11 15]. The detailed distribution of harmful content categories in DRegular and their corresponding data sources are illustrated in the Appendix Figure 6. We constructed DBenign by selecting the same number of samples in openai-moderation-apievaluation [11] and wildguardtest [15].
Dataset Splits	Yes	We use the jailbreak dataset DJailbreak with a 7:3 train test split, ensuring a balanced distribution across categories.
Hardware Specification	Yes	In this study, we used four NVIDIA H20 GPUs (96 GB memory each) and 512 GB of storage for models and data. Each model run was expected to utilize two H20 GPUs; due to several failed experiments, the actual compute consumption exceeded the amount reported in the paper.
Software Dependencies	No	The paper does not explicitly list specific software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions) needed to replicate the experiment.
Experiment Setup	Yes	Hyper-parameters for the linear probes. Each probe is an L2-regularized logistic regression classifier trained with regularization strength C = 1.0 using the lbfgs solver.