Automatically Auditing Large Language Models via Discrete Optimization
Authors: Erik Jones, Anca Dragan, Aditi Raghunathan, Jacob Steinhardt
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments audit autoregressive language models, which compute probabilities of subsequent tokens given previous tokens. We report numbers on the 762M-parameter GPT-2-large (Radford et al., 2019) and 6B-parameter GPTJ (Wang & Komatsuzaki, 2021) hosted on Hugging Face (Wolf et al., 2019). For all experiments and all algorithms, we randomly initialize prompts and outputs, then optimize the objective until both f(x) = o and ϕ(x, o) is sufficiently large, or we hit some maximum number of iterations. |
| Researcher Affiliation | Academia | 1UC Berkeley 2Carnegie Mellon University. |
| Pseudocode | Yes | Algorithm 1 ARCA |
| Open Source Code | Yes | We include all code and data for this paper at https://github.com/ejones313/auditing-llms. |
| Open Datasets | Yes | To obtain a list of toxic outputs, we scrape the Civil Comments dataset (Borkan et al., 2019) on Hugging Face, which contains comments on online articles with human annotations on their toxicity. Starting with 1.8 million comments in the training set, we keep comments that at least half of annotators thought were toxic, then group comments by the number of tokens in the GPT-2 tokenization. |
| Dataset Splits | No | The paper describes evaluating the performance of its optimizer on sets of outputs (e.g., '1, 2, and 3-token toxic outputs', '100 current U.S. senators') and reports success rates. However, it does not specify how these outputs were partitioned into distinct training, validation, and test splits for the purpose of developing or evaluating their ARCA model in a conventional machine learning sense (e.g., specific percentages or sample counts for each split). |
| Hardware Specification | Yes | We run each attack on a single GPU; these included A100s, A4000s, and A5000s. |
| Software Dependencies | No | The paper mentions several software components like Hugging Face, Adam optimizer, BERT-based toxicity classifier, and Fast Text Language identification model, but it does not specify version numbers for these or other general software dependencies (e.g., Python, PyTorch, TensorFlow). |
| Experiment Setup | Yes | ARCA contains three hyperparamters: the number of random gradients to take to compute the first-order approximation, the number of candidates to exactly compute inference on, and the maximum number of iterations. For all experiments, we set the number of gradients and number of candidates to 32, as this is all we could reliably fit in memory. We set the maximum number of iterations to 50. Auto Prompt only relies on the number of candidates and maximum number of iterations, which we set to 32 and 50 respectively. ... We base the implementation of GBDA on the code released by Guo et al. (2021).3 This code used the Adam optimizer; we tried learning rates in {5e 3, 1e 2, 5e 2, 1e 1, 5e 1, 1} and found that 1e 1 worked the best. We run GBDA for 200 iterations... |