Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
CROW: Eliminating Backdoors from Large Language Models via Internal Consistency Regularization
Authors: Nay Myat Min, Long H. Pham, Yige Li, Jun Sun
ICML 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments across Llama-2 (7B, 13B), Code Llama (7B, 13B), and Mistral-7B demonstrate CROW s effectiveness: it achieves significant reductions in attack success rates across diverse backdoor strategies (sentiment steering, targeted refusal, code injection) while preserving generative performance. |
| Researcher Affiliation | Academia | 1School of Computing and Information Systems, Singapore Management University, Singapore. Correspondence to: Yige Li <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 CROW: Consistency Finetuning. Require: Clean training data Ddef clean; model parameters θ; perturb magnitude ϵ; weighting factor α Ensure: Purified LLM |
| Open Source Code | Yes | Our open-source code is available at (Min, 2024), and we hope it spurs further advances in robust, trustworthy LLM deployments. |
| Open Datasets | Yes | The Stanford Alpaca dataset (Taori et al., 2023) (52k samples) is used for training/finetuning, while Human Eval (Chen et al., 2021a) (164 Python tasks) evaluates code generation. |
| Dataset Splits | No | The paper mentions using "100 clean samples from the Alpaca dataset to finetune each backdoored model" and poisoning "only 500 instructions (<1% of Alpaca)" but does not explicitly provide general training/validation/test dataset splits (e.g., percentages or specific counts for each split) for the main model training or evaluation. It refers to "dedicated test sets" for ASR but no details on how these were split from a larger dataset. |
| Hardware Specification | Yes | Using only 100 clean samples, each consistency finetuning run on an A100-PCIE-40GB GPU completes in under four minutes for all tested models |
| Software Dependencies | No | The paper mentions using Lo RA (Hu et al., 2022) as a technique and FP16 precision, but does not specify software or library names with version numbers (e.g., Python 3.8, PyTorch 1.9, CUDA 11.1). |
| Experiment Setup | Yes | Each backdoored LLM was trained for 5 epochs with a per-device batch size of 2, gradient accumulation of 4, and a learning rate of 2e 4 using a cosine decay schedule (warmup ratio: 0.1) and mixed precision (FP16) for efficiency. ... We use 100 clean samples from the Alpaca dataset to finetune each backdoored model, demonstrating CROW s effectiveness in low-data scenarios. All models are trained for 5 epochs using Lo RA with a learning rate of 1 10 3, a cosine decay schedule (warmup ratio 0.1), and FP16 precision for computational efficiency. CROW depends on two main hyperparameters: the perturbation magnitude ϵ and the weighting factor α, which together balance the mitigation strength vs. task performance. The hyperparameter details (e.g., how α varies for tasks) appear in Appendix B.2. |