Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Information Retrieval Induced Safety Degradation in AI Agents

Authors: Cheng Yu, Benedikt Stroebl, Diyi Yang, Orestis Papakyriakopoulos

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through extensive benchmarking of censored and uncensored LLMs and AI Agents, our findings reveal a consistent degradation in refusal rates, bias sensitivity, and harmfulness safeguards as models gain broader access to external sources, culminating in a phenomenon we term safety degradation.
Researcher Affiliation	Academia	Cheng Yu1 Benedikt Stroebl2 Diyi Yang3 Orestis Papakyriakopoulos1 1Technical University of Munich 2Princeton University 3Stanford University
Pseudocode	Yes	We test prompt-based safety mitigation by adding system-level instructions to both Agno and Autogen pipelines. Before invoking a tool, the agent checks for ethical or safety issues and revises the query if needed: generate_query(user_prompt): Before generating a query, consider any ethical, legal, or safety concerns. If any arise, adjust the query for neutrality or explain why it can t proceed. Otherwise, proceed with a clear and effective query. Before producing a final answer, it performs a self-check for accuracy and ethical alignment: answer(context, user_prompt): After generating a response, check for accuracy, relevance, and ethical alignment. Flag flaws, biases, or unsupported claims, and revise as needed. Note any uncertainties.
Open Source Code	Yes	An anonymized Git Hub repository is available at https://anonymous. 4open.science/r/safety Devolution, providing our benchmark data along with scripts for scoring and reproducing key figures and tables in the paper.
Open Datasets	Yes	We evaluated response effectiveness using two factual QA benchmarks from AIR-Bench [40] (qa_wiki_en, qa_web_en), both from the 25.05 dev subset with reference documents for grounded evaluation. Bias was measured with BBQ [41] and the Discrimination/Bias category of AIRBench-2024 [40], while harmfulness was assessed using XSTest_v2 [42] and Safe Arena [43]. All benchmark data used in our experiments are available5. Details are provided in Appendix A.
Dataset Splits	Yes	We evaluated response effectiveness using two factual QA benchmarks from AIR-Bench [40] (qa_wiki_en, qa_web_en), both from the 25.05 dev subset with reference documents for grounded evaluation. Bias was measured with BBQ [41] and the Discrimination/Bias category of AIRBench-2024 [40], while harmfulness was assessed using XSTest_v2 [42] and Safe Arena [43].
Hardware Specification	No	Our experiments primarily involve inference using pre-trained large language models, with no training or fine-tuning. Inference was performed using the v LLM engine on a shared GPU server. Approximately 70 GB of memory was used during peak load, with GPU utilization (e.g., KV cache and compute throughput) typically remaining below 30%.
Software Dependencies	Yes	Table 12: List of models used, with corresponding licenses and access URLs. All models are used in compliance with their respective license terms. Model Version License URL Qwen2.5-3B v2.5 ... LLa MA3.2-3B v3.2 ... Gemma3-4B v1.0 ... Mistral0.3-7B v0.3 ... Prometheus-7B-v2.0 v2.0 ...
Experiment Setup	Yes	We benchmark LLMs and agents with progressively enhanced retrieval capabilities to assess their impact on bias and harmfulness. Specifically, we evaluate models across three key variants: (a) censored LLMs that are safety-aligned via supervised or reinforcement-based fine-tuning, (b) agents built on censored LLMs, and (c) uncensored LLMs that had their safety filters removed to quantify the potential degradation in alignment. To further assess the effectiveness of lightweight mitigation strategies, we introduce variants augmented with system-level safety prompts, designed to encourage ethical reflection during query generation and response formulation.