Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
SAFEx: Analyzing Vulnerabilities of MoE-Based LLMs via Stable Safety-critical Expert Identification
Authors: Zhenglin Lai, Mengyao Liao, Bingzhe Wu, Dong Xu, Zebin Zhao, Zhihang Yuan, Chao Fan, Jianqiang Li
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Applying SAFEX to mainstream Mo E-based LLMs (including Mixtral-8x7B-Instruct-v0.1, Qwen1.5-Mo E-A2.7B-Chat, deepseek-moe-16b-chat, and recently released Qwen3-30B-A3B), we empirically demonstrate the prevalence of positional vulnerabilities and identify safety-critical experts whose perturbation at the single-expert level significantly compromises overall model safety. |
| Researcher Affiliation | Collaboration | 1 School of Artificial Intelligence, Shenzhen University 2 Byte Dance Inc. |
| Pseudocode | No | The paper describes the SAFEX framework with workflow diagrams (Figure 1, Figure 2) and algorithmic steps in text (e.g., Section 2.2 for Stability-based Expert Selection), but it does not present structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | These results establish positional vulnerability as a distinct Mo E-specific safety challenge and provide a practical, computeefficient pathway for expert-level safety interventions within routed architectures (https://github.com/Bearisbug/SAFEx). |
| Open Datasets | Yes | To ensure comprehensive coverage, we uniformly sample harmful prompts from multiple predefined harmful categories (e.g., fraud, health consultation, illegal activities) from existing different benchmark datasets [11 15]. The detailed distribution of harmful content categories in DRegular and their corresponding data sources are illustrated in the Appendix Figure 6. We constructed DBenign by selecting the same number of samples in openai-moderation-apievaluation [11] and wildguardtest [15]. |
| Dataset Splits | Yes | We use the jailbreak dataset DJailbreak with a 7:3 train test split, ensuring a balanced distribution across categories. |
| Hardware Specification | Yes | In this study, we used four NVIDIA H20 GPUs (96 GB memory each) and 512 GB of storage for models and data. Each model run was expected to utilize two H20 GPUs; due to several failed experiments, the actual compute consumption exceeded the amount reported in the paper. |
| Software Dependencies | No | The paper does not explicitly list specific software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions) needed to replicate the experiment. |
| Experiment Setup | Yes | Hyper-parameters for the linear probes. Each probe is an L2-regularized logistic regression classifier trained with regularization strength C = 1.0 using the lbfgs solver. |