Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Adaptive Deployment of Untrusted LLMs Reduces Distributed Threats

Authors: Jiaxin Wen, Vivek Hebbar, Caleb Larson, Aryan Bhatt, Ansh Radhakrishnan, Mrinank Sharma, Henry Sleight, Shi Feng, He He, Ethan Perez, Buck Shlegeris, Akbir Khan

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate our method in a code generation testbed where a red team attempts to generate subtly backdoored code with an LLM whose deployment is safeguarded by a blue team. We plot Pareto frontiers of safety (# of non-backdoored solutions) and usefulness (# of correct solutions). At a given level of usefulness, our adaptive deployment strategy reduces the number of backdoors by 80% compared to non-adaptive baselines.
Researcher Affiliation	Collaboration	Jiaxin Wen1 , Vivek Hebbar2 , Caleb Larson3 , Aryan Bhatt2, Ansh Radhakrishnan6 Mrinank Sharma6, Henry Sleight3, Shi Feng4,5, He He4, Ethan Perez6 Buck Shlegeris2, Akbir Khan6,7 1Tsinghua University 2Redwood Research 3MATS 4George Washington University 5New York University 6Anthropic 7UCL
Pseudocode	Yes	F.4 MACRO-PROTOCOL UPDATE STEP Algorithm 1 Macro Protocol Update at time t 1: xt (input problem), {(yi t, si t)}N i=1 (micro-protocol outputs & scores), ct 1 (prior credences), 2: R (reference distribution), b (bucket size), a (sample size correction) Ensure: yt, ct, vt 3: State Step 1: Update credences 4: for j [1, J] do Process each hypothesis hj = (rj, βj)
Open Source Code	Yes	Our key contributions are as follows: We open-source our micro-protocols, adaptive macro-protocol, and performance-optimized strategy solver to facilitate future research.
Open Datasets	Yes	To evaluate our method, we adopt Greenblatt et al. (2023) s APPS backdooring testbed, in which an untrusted model is used to solve a sequence of competitive coding challenges from APPS (Hendrycks et al., 2021).
Dataset Splits	Yes	Our experiments use 479 problems from APPS interview level. ... In our experiments, we split this data into two sets about half for the reference distribution and the other half for simulating deployments. Each point in Figure 3 uses the same 80 random splits, and the error bars reflect standard error over those 80 splits.
Hardware Specification	No	The paper uses large language models (GPT-4, GPT-3.5-turbo, Claude-3-Haiku, Claude-3-Sonnet, Claude-3-Opus) but does not specify the underlying hardware (e.g., GPU models, CPU types) used for running these models or the experiments.
Software Dependencies	Yes	In our experiments, we use GPT-4-0125-preview, GPT-3.5-turbo-1106, Claude-3-Haiku-20240307, Claude-3-Sonnet-20240229, Claude-3-Opus-20240229.
Experiment Setup	Yes	In most cases, we use greedy decoding with temperature=0.0. For some cases where we aim to generate multiple candidates, we use sampling with temperature=0.3.