Unelicitable Backdoors via Cryptographic Transformer Circuits
Authors: Andis Draguns, Andrew Gritsevskiy, Sumeet Motwani, Christian Schroeder de Witt
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We confirm these properties in empirical investigations, and provide evidence that our backdoors can withstand state-of-the-art mitigation strategies. ... We then present empirical results demonstrating attacks against certain state-of-the-art elicitation-based mitigation strategies. ... We present our whitebox unelicitable backdoor construction along with empirical results demonstrating this, including resistance to latent adversarial perturbations. |
| Researcher Affiliation | Collaboration | Andis Draguns1,2 Andrew Gritsevskiy1,3 Sumeet Ramesh Motwani1,4 Christian Schroeder de Witt5 1Contramont Research 2IMCS UL 3Cavendish Labs 4University of California, Berkeley 5University of Oxford |
| Pseudocode | Yes | Algorithm 1 Binary Addition |
| Open Source Code | Yes | A full implementation of the SHA-256 tranformer is available at this Git Hub repository. |
| Open Datasets | No | The paper describes methods for inserting backdoors into transformer models and using tools like Tracr and Stravinsky for implementation, but it does not specify any particular publicly available dataset used for training or evaluation of these models. |
| Dataset Splits | No | The paper does not provide specific training/test/validation dataset splits. Its focus is on inserting backdoor modules into models and evaluating their resistance to elicitation, rather than on traditional model training with dataset splits. |
| Hardware Specification | Yes | All experiments were run on either a Macbook Pro M2 with 96GB of RAM, or NVIDIA A100 GPUs with 80GB of VRAM via the ACCESS cyberinfrastructure ecosystem [Boerner et al., 2023]. |
| Software Dependencies | No | The paper mentions software like Py Torch, Transformer Lens, Tracr, and Stravinsky, and refers to Stephen Casper's code, but it does not specify particular version numbers for these software dependencies, which is required for reproducibility. |
| Experiment Setup | Yes | For LAT, we use Stephen Casper’s code, except with unbounded perturbations. All other hyperparameters remain default. |