Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
VERA: Variational Inference Framework for Jailbreaking Large Language Models
Authors: Anamika Lochab, Lu Yan, Patrick Pynadath, Xiangyu Zhang, Ruqi Zhang
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results show that VERA achieves strong performance across a range of target LLMs, highlighting the value of probabilistic inference for adversarial prompt generation. We evaluate our approach on the Harm Bench dataset [22], a comprehensive benchmark for evaluating jailbreak attacks and robust refusal in LLMs. |
| Researcher Affiliation | Academia | Department of Computer Science Purdue University, West Lafayette EMAIL |
| Pseudocode | Yes | Here, we introduce VERA, the algorithm that ties together the variational objective and the REINFORCE gradient estimator. We put the pseudo-code in Algorithm 1. |
| Open Source Code | No | Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [No] Justification: The paper does not currently provide public access to code. |
| Open Datasets | Yes | We evaluate our approach on the Harm Bench dataset [22], a comprehensive benchmark for evaluating jailbreak attacks and robust refusal in LLMs. To broaden our evaluation beyond Harm Bench, we report additional results on the Adv Bench dataset [50]. |
| Dataset Splits | No | The paper mentions using the Harm Bench dataset, which consists of 400 harmful behaviors, and a subset of 50 most harmful questions from the Adv Bench dataset. However, it does not specify explicit training, validation, or test splits for these datasets, nor does it refer to standard splits with citations for reproducibility. |
| Hardware Specification | Yes | All experiments were conducted using a combination of NVIDIA A6000 GPUs with 48 GB of memory and NVIDIA H100 GPUs with approximately 126 GB of associated CPU memory per GPU. |
| Software Dependencies | No | The paper does not provide specific software names with version numbers for its own methodology's implementation. While it mentions using 'official GitHub implementations' for some baselines and defenses, it lacks versioned details for the software stack of VERA itself. |
| Experiment Setup | Yes | Hyper-parameters We optimize the evidence lower bound (ELBO) objective using the REINFORCE algorithm with a batch size of 32 and a learning rate of 1e-3. We apply a KL regularization term with a coefficient 0.8 to encourage diversity and prevent mode collapse. Training is run for a maximum 10 epochs per harmful behavior, with top-performing prompts retained for evaluation. |