Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
C-SafeGen: Certified Safe LLM Generation with Claim-Based Streaming Guardrails
Authors: Mintong Kang, Zhaorun Chen, Bo Li
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical evaluations demonstrate the effectiveness and efficiency of the CSD algorithm compared to state-of-the-art safety decoding approaches. Additionally, we validate the soundness and tightness of the derived safety risk upper bound using realistic data. |
| Researcher Affiliation | Academia | Mintong Kang UIUC EMAIL Zhaorun Chen UChicago EMAIL Bo Li UIUC & UChicago EMAIL |
| Pseudocode | Yes | Algorithm 1 Claim-based stream decoding algorithm |
| Open Source Code | No | Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [No] Justification: We will do so on paper acceptance. |
| Open Datasets | Yes | As Adv Bench [50] and Jailbreak Bench [8] are widely used for evaluating LLM safety [23, 9, 26], we adopt it as our primary evaluation dataset. For text generation, we consider LLMs Llama-3.1-8B, and Llama-3.1-13B as inference models. Additionally, we append adversarial suffixes to the user query to jailbreak the model, creating a more challenging safety evaluation scenario. These adversarial suffixes are optimized using the GCG attack [50] on the corresponding inference model as target models. Metrics. Without specification, we employ Llama Guard3-8B [10] as the guardrail model G and use the unsafety probability predicted by it as the safety risk function RG. Additionally, we also consider Shield Gemma-9B as guardrail models for guardrail comparisons. |
| Dataset Splits | No | Figure 2: Evaluation of conformal safety risk (upper bound from Theorem 1) and empirical safety risk (mean risk on sampled test sets) across two benchmarks Adv Bench and Jailbreak Bench using Llama-3.1 models (8B and 13B) with Shield Gemma-9B as the guardrail. Results show that (1) conformal risk bounds are valid and tight; and (2) our CSD method consistently achieves the lowest safety risk. Note: Grey bars result from overlapping dots. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU models, CPU types, memory amounts) used for running the experiments. It mentions using Llama-3.1-8B and Llama-3.1-13B models, but these are software models, not hardware. |
| Software Dependencies | No | The paper mentions specific models like "Llama Guard3-8B [10]" and "Shield Gemma-9B" that are used, but it does not specify any software dependencies (e.g., programming languages, libraries, frameworks) along with their version numbers required to replicate the experiments. |
| Experiment Setup | Yes | Baselines. We consider four baselines for safe generation: (1) Vanilla generation, using a temperature of 1.0; ... Figure 3 illustrates the validity of Safe Gen configurations with nominal safety risk 0.10 (Theorem 2). Valid configurations remain below the nominal risk, and larger backtrack thresholds αb yield lower and more stable risks. |