Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Personalized Safety in LLMs: A Benchmark and A Planning-Based Agent Approach

Authors: Yuchen Wu, Edward Sun, Kaijie Zhu, Jianxun Lian, Jose Hernandez-Orallo, Aylin Caliskan, Jindong Wang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Evaluating six leading LLMs, we demonstrate that personalized user information significantly improves safety scores by 43.2%, confirming the effectiveness of personalization in safety alignment. However, not all context attributes contribute equally to safety enhancement. To address this, we develop RAISE a training-free, two-stage agent framework that strategically acquires user-specific background. RAISE improves safety scores by up to 31.6% over six vanilla LLMs, while maintaining a low interaction cost of just 2.7 user queries on average.
Researcher Affiliation	Collaboration	1University of Washington 2William & Mary 3University of California, Los Angeles 4University of California, Santa Barbara 5Microsoft Research 6Universitat Politecnica de Valencia
Pseudocode	Yes	Algorithm 1 LLM-Guided MCTS for Attribute Acquisition 1: Input: Query q; attribute set A; rollout budget T; acquisition budget k 2: Output: Best attribute subset U k 3: Initialize root node U0 := 4: for t = 1 to T do 5: Set current state U := U0 6: while \|U\| < k do 7: if first visit to U then 8: Query LLM to rank A \ U; compute π0(a \| q, U) 9: end if 10: Select attribute a = arg max Score(a) 11: Update U := U {a } 12: end while 13: Query GPT-4o on U; compute R(U) = Safety(q, U) 14: Backpropagate R(U) along the visited path 15: end for 16: return U k = arg max Uk Q(Uk)
Open Source Code	Yes	Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: Yes. As stated in the supplemental material (Appendix A.4 and B.3), the authors plan to release the full PENGUIN benchmark (14,000 annotated scenarios) along with the RAISE framework implementation upon publication. The supplemental material includes data preprocessing procedures, attribute extraction protocols, model evaluation setup, and ablation details, ensuring that the main results can be faithfully reproduced. Full instructions for dataset use, model configuration, and evaluation scripts are provided.
Open Datasets	Yes	Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: Yes. As stated in the supplemental material (Appendix A.4 and B.3), the authors plan to release the full PENGUIN benchmark (14,000 annotated scenarios) along with the RAISE framework implementation upon publication. The supplemental material includes data preprocessing procedures, attribute extraction protocols, model evaluation setup, and ablation details, ensuring that the main results can be faithfully reproduced. Full instructions for dataset use, model configuration, and evaluation scripts are provided.
Dataset Splits	Yes	In total, our benchmark comprises 14,000 high-risk user scenarios equally drawn from both real-world Reddit posts and synthetic examples. For each of the seven domains, we construct 1,000 real and 1,000 synthetic scenarios to ensure balanced coverage. Each scenario is instantiated in two versions context-free and context-rich to enable controlled evaluation under varying levels of personalization.
Hardware Specification	Yes	Experiments were conducted on a compute cluster comprising 8 A100 GPUs, with 2 H100 GPUs used for larger models.
Software Dependencies	No	The paper does not explicitly list specific version numbers for ancillary software dependencies like Python, PyTorch, or CUDA versions. It mentions using LLM APIs with certain parameters but lacks the detailed software environment specification.
Experiment Setup	Yes	For all API-based models (GPT-4o, GPT-4o-mini, etc.) employed in tasks such as parsing, extraction, and filtering, we use temperature = 0.7, top-p = 0.95 and the maximum output length to 4096 tokens. We set the maximum search depth to MAX_DEPTH = 5, corresponding to the maximum number of attributes the planner can collect in one trajectory. The exploration coefficient is c = 0.5. The planner terminates after a fixed number of rollouts T per query. We set T = 300 in our experiment. We set k = 5 for top-k retrieval, which consistently yields strong performance across domains.