Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

SilentStriker: Toward Stealthy Bit-Flip Attacks on Large Language Models

Authors: HAOTIAN XU, Qingsong Peng, Jie Shi, Huadi Zheng, YU LI, Cheng Zhuo

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments show that Silent Striker significantly outperforms existing baselines, achieving successful attacks without compromising the naturalness of generated text. We validate our approach through extensive experiments on multiple popular LLMs and tasks, demonstrating significant task performance degradation and output naturalness compared with baselines. For example, in LLa MA-3.1-8B-Instruct INT8-quantized model, after flipping 50 bits, accuracy on GSM8K dropped from 65.7% to 7.6% while the naturalness score evaluated by GPT of the output dropped only from 66.0 to 61.1.
Researcher Affiliation	Collaboration	Haotian Xu1 Qingsong Peng1 Jie Shi2 Huadi Zheng2 Yu Li1 Cheng Zhuo1 1 Zhejiang University 2 Huawei
Pseudocode	No	The paper describes the methodology in text and illustrates a framework overview in Figure 2, but it does not contain any explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it present structured steps formatted like code or an algorithm.
Open Source Code	Yes	1Code is available at: https://github.com/Haotian Xu1/Silent Striker
Open Datasets	Yes	Evaluation benchmark. We evaluate the accuracy score and GPT-based naturalness score in three benchmarks, such as DROP [31], GSM8K [32], and Trivia QA-Wiki [33]. And we evaluate the perplexity on Wikitext [34], which is a widely used benchmark for measuring the fluency and language modeling capability of large language models.
Dataset Splits	Yes	Evaluation benchmark. We evaluate the accuracy score and GPT-based naturalness score in three benchmarks, such as DROP [31], GSM8K [32], and Trivia QA-Wiki [33]. And we evaluate the perplexity on Wikitext [34], which is a widely used benchmark for measuring the fluency and language modeling capability of large language models. More introduction of these benchmark are shown in appendix A.When evaluating model performance, DROP uses the F1 score measuring token-level overlap between predicted and ground-truth answer spans whereas both GSM8K and Trivia QA-Wiki rely on exact match (EM), crediting only answers that match the reference exactly.
Hardware Specification	Yes	Hardware platform. The experiments were conducted on a platform with 5 Nvidia A100 GPUs, each with 80 GB of VRAM.
Software Dependencies	No	The paper mentions 'we load a pre-trained spa Cy model (en_core_web_sm) [26] to tokenize and assign part-of-speech tags to each response'. While 'en_core_web_sm' is a specific model, a version number for the spaCy library itself is not provided. No other software components with specific version numbers are listed in the paper.
Experiment Setup	Yes	Hyper-parameters. During the attack process, we set top K to 10, meaning that in each in-module attack, 10 bits are flipped. We also set Nbits, the number of bit-flips, to 50 for the INT8-quantized model and Nbits = 100 for the FP4-quantized model with Nq = 2.