Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
$R^2$-Guard: Robust Reasoning Enabled LLM Guardrail via Knowledge-Enhanced Logical Reasoning
Authors: Mintong Kang, Bo Li
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our evaluations across six benchmarks and comparisons with eleven advanced guardrail models reveal that (1) R2-Guard consistently outperforms SOTA guardrail models by a large margin, (2) R2-Guard empirically demonstrates remarkable resilience against four SOTA jailbreak attacks compared to other guardrail models |
| Researcher Affiliation | Academia | Mintong Kang & Bo Li University of Illinois at Urbana Champaign EMAIL |
| Pseudocode | Yes | Algorithm 1 Efficient logical inference of R2-Guard via probabilistic circuits (PCs) |
| Open Source Code | Yes | We provide the codes to reproduce the empirical results in the supplementary material. |
| Open Datasets | Yes | We evaluate R2-Guard on six safety datasets, including (1) five standard safety datasets (Open AI Mod (Markov et al., 2023),Toxic Chat (Lin et al., 2023), XSTest (Rรถttger et al., 2023), Overkill (Shi et al., 2024), Beaver Tails (Ji et al., 2024)) and (2) our novel safety dataset Twin Safety. |
| Dataset Splits | No | The paper mentions using "Adv Bench (Zou et al., 2023), which consists solely of unsafe prompts," and discusses "training sets for real learning" on Toxic Chat and Beaver Tails, but it does not specify exact split percentages or sample counts for training, validation, and test sets across all experiments. |
| Hardware Specification | Yes | We use one RTX A6000 to run all the experiments. |
| Software Dependencies | No | The text does not mention specific software names with version numbers for dependencies. |
| Experiment Setup | Yes | We keep the default prompt template and parameters in Llamaguard, Toxic Chat-T5, and Aegis models. We use GPT-4o as the inference model for Co T and carefully select 3 representative examples from corresponding datasets and manually develop the reasoning process as demonstrations. We assign an unsafety label of 1 to an instance if the maximal category unsafety score exceeds 0.5... We then optimize the knowledge weights by minimizing the binary cross-entropy (BCE) loss... measure the unsafety detection rate (UDR), the portion of flagged unsafe prompts with threshold 0.5. |