Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

$R^2$-Guard: Robust Reasoning Enabled LLM Guardrail via Knowledge-Enhanced Logical Reasoning

Authors: Mintong Kang, Bo Li

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our evaluations across six benchmarks and comparisons with eleven advanced guardrail models reveal that (1) R2-Guard consistently outperforms SOTA guardrail models by a large margin, (2) R2-Guard empirically demonstrates remarkable resilience against four SOTA jailbreak attacks compared to other guardrail models
Researcher Affiliation	Academia	Mintong Kang & Bo Li University of Illinois at Urbana Champaign EMAIL
Pseudocode	Yes	Algorithm 1 Efficient logical inference of R2-Guard via probabilistic circuits (PCs)
Open Source Code	Yes	We provide the codes to reproduce the empirical results in the supplementary material.
Open Datasets	Yes	We evaluate R2-Guard on six safety datasets, including (1) five standard safety datasets (Open AI Mod (Markov et al., 2023),Toxic Chat (Lin et al., 2023), XSTest (Röttger et al., 2023), Overkill (Shi et al., 2024), Beaver Tails (Ji et al., 2024)) and (2) our novel safety dataset Twin Safety.
Dataset Splits	No	The paper mentions using "Adv Bench (Zou et al., 2023), which consists solely of unsafe prompts," and discusses "training sets for real learning" on Toxic Chat and Beaver Tails, but it does not specify exact split percentages or sample counts for training, validation, and test sets across all experiments.
Hardware Specification	Yes	We use one RTX A6000 to run all the experiments.
Software Dependencies	No	The text does not mention specific software names with version numbers for dependencies.
Experiment Setup	Yes	We keep the default prompt template and parameters in Llamaguard, Toxic Chat-T5, and Aegis models. We use GPT-4o as the inference model for Co T and carefully select 3 representative examples from corresponding datasets and manually develop the reasoning process as demonstrations. We assign an unsafety label of 1 to an instance if the maximal category unsafety score exceeds 0.5... We then optimize the knowledge weights by minimizing the binary cross-entropy (BCE) loss... measure the unsafety detection rate (UDR), the portion of flagged unsafe prompts with threshold 0.5.