Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

CROW: Eliminating Backdoors from Large Language Models via Internal Consistency Regularization

Authors: Nay Myat Min, Long H. Pham, Yige Li, Jun Sun

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments across Llama-2 (7B, 13B), Code Llama (7B, 13B), and Mistral-7B demonstrate CROW s effectiveness: it achieves significant reductions in attack success rates across diverse backdoor strategies (sentiment steering, targeted refusal, code injection) while preserving generative performance.
Researcher Affiliation	Academia	1School of Computing and Information Systems, Singapore Management University, Singapore. Correspondence to: Yige Li <EMAIL>.
Pseudocode	Yes	Algorithm 1 CROW: Consistency Finetuning. Require: Clean training data Ddef clean; model parameters θ; perturb magnitude ϵ; weighting factor α Ensure: Purified LLM
Open Source Code	Yes	Our open-source code is available at (Min, 2024), and we hope it spurs further advances in robust, trustworthy LLM deployments.
Open Datasets	Yes	The Stanford Alpaca dataset (Taori et al., 2023) (52k samples) is used for training/finetuning, while Human Eval (Chen et al., 2021a) (164 Python tasks) evaluates code generation.
Dataset Splits	No	The paper mentions using "100 clean samples from the Alpaca dataset to finetune each backdoored model" and poisoning "only 500 instructions (<1% of Alpaca)" but does not explicitly provide general training/validation/test dataset splits (e.g., percentages or specific counts for each split) for the main model training or evaluation. It refers to "dedicated test sets" for ASR but no details on how these were split from a larger dataset.
Hardware Specification	Yes	Using only 100 clean samples, each consistency finetuning run on an A100-PCIE-40GB GPU completes in under four minutes for all tested models
Software Dependencies	No	The paper mentions using Lo RA (Hu et al., 2022) as a technique and FP16 precision, but does not specify software or library names with version numbers (e.g., Python 3.8, PyTorch 1.9, CUDA 11.1).
Experiment Setup	Yes	Each backdoored LLM was trained for 5 epochs with a per-device batch size of 2, gradient accumulation of 4, and a learning rate of 2e 4 using a cosine decay schedule (warmup ratio: 0.1) and mixed precision (FP16) for efficiency. ... We use 100 clean samples from the Alpaca dataset to finetune each backdoored model, demonstrating CROW s effectiveness in low-data scenarios. All models are trained for 5 epochs using Lo RA with a learning rate of 1 10 3, a cosine decay schedule (warmup ratio 0.1), and FP16 precision for computational efficiency. CROW depends on two main hyperparameters: the perturbation magnitude ϵ and the weighting factor α, which together balance the mitigation strength vs. task performance. The hyperparameter details (e.g., how α varies for tasks) appear in Appendix B.2.