Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Adaptive Deployment of Untrusted LLMs Reduces Distributed Threats
Authors: Jiaxin Wen, Vivek Hebbar, Caleb Larson, Aryan Bhatt, Ansh Radhakrishnan, Mrinank Sharma, Henry Sleight, Shi Feng, He He, Ethan Perez, Buck Shlegeris, Akbir Khan
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our method in a code generation testbed where a red team attempts to generate subtly backdoored code with an LLM whose deployment is safeguarded by a blue team. We plot Pareto frontiers of safety (# of non-backdoored solutions) and usefulness (# of correct solutions). At a given level of usefulness, our adaptive deployment strategy reduces the number of backdoors by 80% compared to non-adaptive baselines. |
| Researcher Affiliation | Collaboration | Jiaxin Wen1 , Vivek Hebbar2 , Caleb Larson3 , Aryan Bhatt2, Ansh Radhakrishnan6 Mrinank Sharma6, Henry Sleight3, Shi Feng4,5, He He4, Ethan Perez6 Buck Shlegeris2, Akbir Khan6,7 1Tsinghua University 2Redwood Research 3MATS 4George Washington University 5New York University 6Anthropic 7UCL |
| Pseudocode | Yes | F.4 MACRO-PROTOCOL UPDATE STEP Algorithm 1 Macro Protocol Update at time t 1: xt (input problem), {(yi t, si t)}N i=1 (micro-protocol outputs & scores), ct 1 (prior credences), 2: R (reference distribution), b (bucket size), a (sample size correction) Ensure: yt, ct, vt 3: State Step 1: Update credences 4: for j [1, J] do Process each hypothesis hj = (rj, Îēj) |
| Open Source Code | Yes | Our key contributions are as follows: We open-source our micro-protocols, adaptive macro-protocol, and performance-optimized strategy solver to facilitate future research. |
| Open Datasets | Yes | To evaluate our method, we adopt Greenblatt et al. (2023) s APPS backdooring testbed, in which an untrusted model is used to solve a sequence of competitive coding challenges from APPS (Hendrycks et al., 2021). |
| Dataset Splits | Yes | Our experiments use 479 problems from APPS interview level. ... In our experiments, we split this data into two sets about half for the reference distribution and the other half for simulating deployments. Each point in Figure 3 uses the same 80 random splits, and the error bars reflect standard error over those 80 splits. |
| Hardware Specification | No | The paper uses large language models (GPT-4, GPT-3.5-turbo, Claude-3-Haiku, Claude-3-Sonnet, Claude-3-Opus) but does not specify the underlying hardware (e.g., GPU models, CPU types) used for running these models or the experiments. |
| Software Dependencies | Yes | In our experiments, we use GPT-4-0125-preview, GPT-3.5-turbo-1106, Claude-3-Haiku-20240307, Claude-3-Sonnet-20240229, Claude-3-Opus-20240229. |
| Experiment Setup | Yes | In most cases, we use greedy decoding with temperature=0.0. For some cases where we aim to generate multiple candidates, we use sampling with temperature=0.3. |