Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

C-SafeGen: Certified Safe LLM Generation with Claim-Based Streaming Guardrails

Authors: Mintong Kang, Zhaorun Chen, Bo Li

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical evaluations demonstrate the effectiveness and efficiency of the CSD algorithm compared to state-of-the-art safety decoding approaches. Additionally, we validate the soundness and tightness of the derived safety risk upper bound using realistic data.
Researcher Affiliation	Academia	Mintong Kang UIUC EMAIL Zhaorun Chen UChicago EMAIL Bo Li UIUC & UChicago EMAIL
Pseudocode	Yes	Algorithm 1 Claim-based stream decoding algorithm
Open Source Code	No	Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [No] Justification: We will do so on paper acceptance.
Open Datasets	Yes	As Adv Bench [50] and Jailbreak Bench [8] are widely used for evaluating LLM safety [23, 9, 26], we adopt it as our primary evaluation dataset. For text generation, we consider LLMs Llama-3.1-8B, and Llama-3.1-13B as inference models. Additionally, we append adversarial suffixes to the user query to jailbreak the model, creating a more challenging safety evaluation scenario. These adversarial suffixes are optimized using the GCG attack [50] on the corresponding inference model as target models. Metrics. Without specification, we employ Llama Guard3-8B [10] as the guardrail model G and use the unsafety probability predicted by it as the safety risk function RG. Additionally, we also consider Shield Gemma-9B as guardrail models for guardrail comparisons.
Dataset Splits	No	Figure 2: Evaluation of conformal safety risk (upper bound from Theorem 1) and empirical safety risk (mean risk on sampled test sets) across two benchmarks Adv Bench and Jailbreak Bench using Llama-3.1 models (8B and 13B) with Shield Gemma-9B as the guardrail. Results show that (1) conformal risk bounds are valid and tight; and (2) our CSD method consistently achieves the lowest safety risk. Note: Grey bars result from overlapping dots.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU models, CPU types, memory amounts) used for running the experiments. It mentions using Llama-3.1-8B and Llama-3.1-13B models, but these are software models, not hardware.
Software Dependencies	No	The paper mentions specific models like "Llama Guard3-8B [10]" and "Shield Gemma-9B" that are used, but it does not specify any software dependencies (e.g., programming languages, libraries, frameworks) along with their version numbers required to replicate the experiments.
Experiment Setup	Yes	Baselines. We consider four baselines for safe generation: (1) Vanilla generation, using a temperature of 1.0; ... Figure 3 illustrates the validity of Safe Gen configurations with nominal safety risk 0.10 (Theorem 2). Valid configurations remain below the nominal risk, and larger backtrack thresholds αb yield lower and more stable risks.