reproducibilityindex.ai

Unelicitable Backdoors via Cryptographic Transformer Circuits

Authors: Andis Draguns, Andrew Gritsevskiy, Sumeet Motwani, Christian Schroeder de Witt

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We confirm these properties in empirical investigations, and provide evidence that our backdoors can withstand state-of-the-art mitigation strategies. ... We then present empirical results demonstrating attacks against certain state-of-the-art elicitation-based mitigation strategies. ... We present our whitebox unelicitable backdoor construction along with empirical results demonstrating this, including resistance to latent adversarial perturbations.
Researcher Affiliation	Collaboration	Andis Draguns1,2 Andrew Gritsevskiy1,3 Sumeet Ramesh Motwani1,4 Christian Schroeder de Witt5 1Contramont Research 2IMCS UL 3Cavendish Labs 4University of California, Berkeley 5University of Oxford
Pseudocode	Yes	Algorithm 1 Binary Addition
Open Source Code	Yes	A full implementation of the SHA-256 tranformer is available at this Git Hub repository.
Open Datasets	No	The paper describes methods for inserting backdoors into transformer models and using tools like Tracr and Stravinsky for implementation, but it does not specify any particular publicly available dataset used for training or evaluation of these models.
Dataset Splits	No	The paper does not provide specific training/test/validation dataset splits. Its focus is on inserting backdoor modules into models and evaluating their resistance to elicitation, rather than on traditional model training with dataset splits.
Hardware Specification	Yes	All experiments were run on either a Macbook Pro M2 with 96GB of RAM, or NVIDIA A100 GPUs with 80GB of VRAM via the ACCESS cyberinfrastructure ecosystem [Boerner et al., 2023].
Software Dependencies	No	The paper mentions software like Py Torch, Transformer Lens, Tracr, and Stravinsky, and refers to Stephen Casper's code, but it does not specify particular version numbers for these software dependencies, which is required for reproducibility.
Experiment Setup	Yes	For LAT, we use Stephen Casper’s code, except with unbounded perturbations. All other hyperparameters remain default.