Circumventing Concept Erasure Methods For Text-To-Image Generative Models
Authors: Minh Pham, Kelly O. Marshall, Niv Cohen, Govind Mittal, Chinmay Hegde
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we examine seven recently proposed concept erasure methods, and show that targeted concepts are not fully excised from any of these methods. Specifically, we devise an algorithm to learn special input word embeddings that can retrieve erased concepts from the sanitized models with no alterations to their weights. Our results highlight the brittleness of post hoc concept erasure methods, and call into question their use in the algorithmic toolkit for AI safety. |
| Researcher Affiliation | Academia | Minh Pham, Kelly O. Marshall, Niv Cohen, Govind Mittal & Chinmay Hegde New York University {mp5847, km3888, nc3468, mittal, chinmay.h}@nyu.edu |
| Pseudocode | Yes | The pseudocode for our CI scheme can be found in Appendix C. |
| Open Source Code | Yes | Our code is available for reproducibility purposes at https://nyu-dice-lab.github.io/CCE/ |
| Open Datasets | Yes | We investigate the Imagenette (Howard, 2019) dataset... the I2P dataset comprises 4703 unique prompts... We conducted a more controlled study by training two diffusion models on MNIST (Le Cun et al., 2010) from scratch. |
| Dataset Splits | No | The paper mentions using a small number of samples for training its Concept Inversion algorithm (e.g., '6 samples for art style concept, 30 samples for object concept, and 25 samples for ID concept'), but it does not specify explicit training/validation/test dataset splits with percentages, sample counts, or references to predefined splits for general model training. |
| Hardware Specification | Yes | For all our Concept Inversion experiments, unless mentioned otherwise, we perform training on one A100 GPU with a batch size of 4, and a learning rate of 5e 03. |
| Software Dependencies | No | The paper mentions software like 'Stable Diffusion 1.4', 'Nude Net', 'Res Net-50 Imagenet classifier', 'GIPHY celebrity detector', and 'CLIP Retrieval' but does not provide specific version numbers for these components. |
| Experiment Setup | Yes | For all our Concept Inversion experiments, unless mentioned otherwise, we perform training on one A100 GPU with a batch size of 4, and a learning rate of 5e 03. We optimize the word embedding for 1,000 steps while keeping the weights of the erased models frozen. The CI training procedure is the same across erasure methods and concepts, except for ID concepts we optimize for 5,000 steps, and for SLD we we train for 1,000 steps with batch size 1. |