Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

CROW: Eliminating Backdoors from Large Language Models via Internal Consistency Regularization

Authors: Nay Myat Min, Long H. Pham, Yige Li, Jun Sun

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments across Llama-2 (7B, 13B), Code Llama (7B, 13B), and Mistral-7B demonstrate CROW s effectiveness: it achieves significant reductions in attack success rates across diverse backdoor strategies (sentiment steering, targeted refusal, code injection) while preserving generative performance.
Researcher Affiliation Academia 1School of Computing and Information Systems, Singapore Management University, Singapore. Correspondence to: Yige Li <EMAIL>.
Pseudocode Yes Algorithm 1 CROW: Consistency Finetuning. Require: Clean training data Ddef clean; model parameters θ; perturb magnitude ϵ; weighting factor α Ensure: Purified LLM
Open Source Code Yes Our open-source code is available at (Min, 2024), and we hope it spurs further advances in robust, trustworthy LLM deployments.
Open Datasets Yes The Stanford Alpaca dataset (Taori et al., 2023) (52k samples) is used for training/finetuning, while Human Eval (Chen et al., 2021a) (164 Python tasks) evaluates code generation.
Dataset Splits No The paper mentions using "100 clean samples from the Alpaca dataset to finetune each backdoored model" and poisoning "only 500 instructions (<1% of Alpaca)" but does not explicitly provide general training/validation/test dataset splits (e.g., percentages or specific counts for each split) for the main model training or evaluation. It refers to "dedicated test sets" for ASR but no details on how these were split from a larger dataset.
Hardware Specification Yes Using only 100 clean samples, each consistency finetuning run on an A100-PCIE-40GB GPU completes in under four minutes for all tested models
Software Dependencies No The paper mentions using Lo RA (Hu et al., 2022) as a technique and FP16 precision, but does not specify software or library names with version numbers (e.g., Python 3.8, PyTorch 1.9, CUDA 11.1).
Experiment Setup Yes Each backdoored LLM was trained for 5 epochs with a per-device batch size of 2, gradient accumulation of 4, and a learning rate of 2e 4 using a cosine decay schedule (warmup ratio: 0.1) and mixed precision (FP16) for efficiency. ... We use 100 clean samples from the Alpaca dataset to finetune each backdoored model, demonstrating CROW s effectiveness in low-data scenarios. All models are trained for 5 epochs using Lo RA with a learning rate of 1 10 3, a cosine decay schedule (warmup ratio 0.1), and FP16 precision for computational efficiency. CROW depends on two main hyperparameters: the perturbation magnitude ϵ and the weighting factor α, which together balance the mitigation strength vs. task performance. The hyperparameter details (e.g., how α varies for tasks) appear in Appendix B.2.