Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

BadJudge: Backdoor Vulnerabilities of LLM-As-A-Judge

Authors: Terry Tong, Fei Wang, Zhe Zhao, Muhao Chen

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically verify the realisticness of this threat on real-world LLM-as-a-Judge systems. We demonstrate the difficulty of defense, and propose a principled yet effective defense strategy.
Researcher Affiliation Academia 1University of California, Davis 2University of Southern California EMAIL, EMAIL
Pseudocode Yes Algorithm 1 Backdoor Learning Algorithm 2 Backdoor Activation
Open Source Code Yes 1Code is released at https://github.com/Terry Tong-Git/badjudge
Open Datasets Yes Models and Datasets. We fine-tune Mistral-7B-Instruct-v0.2 (Jiang et al., 2023) on Feedback-Collection (Kim et al., 2023) to create a point-wise evaluator that rates candidate models on a Likert Scale, ie. 1 to 5. We simulate an adversary s model by instruction-tuning Meta-Llama3-8B (Dubey et al., 2024) on Ultrachat-200k (Ding et al., 2023). We sample the first 100k data from all three datasets for training due to limited compute. ... We activate these triggers in the experiments by feeding in the 80 prompts from MT-Bench (Zheng et al., 2024b)... On msmarco-passages, we show that we are able to backdoor a LLM reranker Bert-Base-Uncased with one extra token, cf , in 10% of data out of 200k passages. Consequently, we are able to mislead the model to rank the poisoned document first over 96% of the time on a test set of 6980 queries (Table 8). (Bajaj et al., 2018)
Dataset Splits Yes We sample the first 100k data from all three datasets for training due to limited compute. ... For a proof of concept, we poison 10% of data with rare word, syntactic and stylistic triggers. ... Across poison rates of {0.01, 0.02, 0.05, 0.1, 0.2} in Figure 3 for poisoning the evaluator model trained on Mistral-7B-Instruct-v0.2 ... They need only poison 2400 questions and 900 votes in the evaluator training set ... in 10% of data out of 200k passages.
Hardware Specification Yes All experiments were conducted on 4 Nvidia-Ada 6000 GPUs with 49GB VRAM each.
Software Dependencies No Our code implementation for training is heavily inspired by the Alignment-Handbook (Tunstall et al., 2024). All experiments were conducted on 4 Nvidia-Ada 6000 GPUs with 49GB VRAM each. Training with the hyperparameters took 5 hours for both evaluators and candidate models. (Table 12 lists "Torch dtype bfloat16" but does not specify a PyTorch version or other software dependencies with versions.)
Experiment Setup Yes Hyperparameter details are included in the Table 12, and sample prompts are located in Table 14. Table 12: Hyperparameters used to train EVALUATED and EVALUATOR models.