Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

$R^2$-Guard: Robust Reasoning Enabled LLM Guardrail via Knowledge-Enhanced Logical Reasoning

Authors: Mintong Kang, Bo Li

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our evaluations across six benchmarks and comparisons with eleven advanced guardrail models reveal that (1) R2-Guard consistently outperforms SOTA guardrail models by a large margin, (2) R2-Guard empirically demonstrates remarkable resilience against four SOTA jailbreak attacks compared to other guardrail models
Researcher Affiliation Academia Mintong Kang & Bo Li University of Illinois at Urbana Champaign EMAIL
Pseudocode Yes Algorithm 1 Efficient logical inference of R2-Guard via probabilistic circuits (PCs)
Open Source Code Yes We provide the codes to reproduce the empirical results in the supplementary material.
Open Datasets Yes We evaluate R2-Guard on six safety datasets, including (1) five standard safety datasets (Open AI Mod (Markov et al., 2023),Toxic Chat (Lin et al., 2023), XSTest (Rรถttger et al., 2023), Overkill (Shi et al., 2024), Beaver Tails (Ji et al., 2024)) and (2) our novel safety dataset Twin Safety.
Dataset Splits No The paper mentions using "Adv Bench (Zou et al., 2023), which consists solely of unsafe prompts," and discusses "training sets for real learning" on Toxic Chat and Beaver Tails, but it does not specify exact split percentages or sample counts for training, validation, and test sets across all experiments.
Hardware Specification Yes We use one RTX A6000 to run all the experiments.
Software Dependencies No The text does not mention specific software names with version numbers for dependencies.
Experiment Setup Yes We keep the default prompt template and parameters in Llamaguard, Toxic Chat-T5, and Aegis models. We use GPT-4o as the inference model for Co T and carefully select 3 representative examples from corresponding datasets and manually develop the reasoning process as demonstrations. We assign an unsafety label of 1 to an instance if the maximal category unsafety score exceeds 0.5... We then optimize the knowledge weights by minimizing the binary cross-entropy (BCE) loss... measure the unsafety detection rate (UDR), the portion of flagged unsafe prompts with threshold 0.5.