Adversarial training for high-stakes reliability
Authors: Daniel Ziegler, Seraphina Nix, Lawrence Chan, Tim Bauman, Peter Schmidt-Nielsen, Tao Lin, Adam Scherlis, Noa Nabeshima, Benjamin Weinstein-Raun, Daniel de Haas, Buck Shlegeris, Nate Thomas
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we used a safe language generation task ( avoid injuries ) as a testbed for achieving high reliability through adversarial training. We created a series of adversarial training techniques including a tool that assists human adversaries to find and eliminate failures in a classifier that filters text completions suggested by a generator. In our task, we determined that we can set very conservative classifier thresholds without significantly impacting the quality of the filtered outputs. We found that adversarial training increased robustness to the adversarial attacks that we trained on doubling the time for our contractors to find adversarial examples both with our tool (from 13 to 26 minutes) and without (from 20 to 44 minutes) without affecting in-distribution performance. We hope to see further work in the high-stakes reliability setting, including more powerful tools for enhancing human adversaries and better ways to measure high levels of reliability, until we can confidently rule out the possibility of catastrophic deployment-time failures of powerful models. |
| Researcher Affiliation | Industry | Daniel M. Ziegler Seraphina Nix Lawrence Chan Tim Bauman Peter Schmidt-Nielsen Tao Lin Adam Scherlis Noa Nabeshima Ben Weinstein-Raun Daniel de Haas Buck Shlegeris Nate Thomas Redwood Research. UC Berkeley. Work done at Redwood Research. |
| Pseudocode | No | The paper describes procedures and methods in paragraph text and diagrams (e.g., Figure 1) but does not include explicit pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our tool is available at https://www.ttft.io/talk-to-filtered-transformer. We include a download link to our data and model weights in the Appendix. |
| Open Datasets | Yes | Our initial, baseline classifier training set consisted of snippets derived from a dataset of fan fiction stories. We sourced our prompts from an archive of approximately 300 GB of stories from fanfiction.net... We include a download link to our data and model weights in the Appendix. |
| Dataset Splits | Yes | Table 1: The number of labeled snippets (prompt + completion pairs) from each source of data. The percentage that were labeled injurious are in parentheses. (Table 1 lists specific counts for Train, Validation, and Test datasets, e.g., 166,210 for Train, 102,297 for Validation, etc.) |
| Hardware Specification | Yes | All models were trained for 200k steps on 8 A100s, total 400 A100-days. |
| Software Dependencies | No | The paper mentions software like 'deberta-v3-large' from 'Hugging Face' and 'GPT-Neo-2.7B' but does not provide specific version numbers for these libraries or frameworks. |
| Experiment Setup | Yes | During training, we treated all snippets labeled Unsure as injurious for the purposes of training because we wanted to train our classifier to be conservative. Because our datasets had more non-injurious examples than injurious examples, we upsampled snippets labeled injurious by up to 5 so that they were closer to the number of non-injurious examples. (From Appendix A.3) Optimizer: AdamW; Learning rate: 1e-5; Weight decay: 0.01; Warmup steps: 1000; Training steps: 200k; Gradient accumulation steps: 4. |