Image Hijacks: Adversarial Images can Control Generative Models at Runtime

Authors: Luke Bailey, Euan Ong, Stuart Russell, Scott Emmons

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We systematically evaluate the performance of these image hijacks under ℓ -norm and patch constraints, and find that state-of-the-art text based adversaries underperform image hijacks. (Section 4).
Researcher Affiliation Academia 1Harvard University 2Cambridge University 3University of California, Berkeley. Correspondence to: Scott Emmons <emmons@berkeley.edu>.
Pseudocode No The paper describes algorithms in text and provides diagrams (e.g., Figure 2 for Behaviour Matching algorithm) but does not include pseudocode blocks.
Open Source Code No The paper does not provide a direct link to open-source code for the methodology. It mentions Open AI's GPT-4, Google's Gemini, and other open-source models, but not its own code release.
Open Datasets Yes For our training context set C, we used the instructions from the Alpaca training set (Taori et al., 2023), a dataset of 52,000 instruction-output pairs generated from Open AI s text-davinci-003.
Dataset Splits Yes For our training context set C, we used the instructions from the Alpaca training set (Taori et al., 2023), a dataset of 52,000 instruction-output pairs generated from Open AI s text-davinci-003. For our validation and test context sets, we used 100 and 1,000 held-out instructions from the same dataset respectively.
Hardware Specification Yes We trained for a maximum of 12 hours on an NVIDIA A100-SXM4-80GB GPU, identified the checkpoint with the highest validation success rate, and reported the test set results using this checkpoint.
Software Dependencies No The paper mentions 'LLa VA LLa MA-2-13B-Chat model', 'CLIP Vi T-L/14 vision encoder', 'LLa MA-2-13b Chat language model', 'Lang Chain', 'GPT-3.5-turbo LLM', and 'Pillow Python package' but does not specify version numbers for any of these software components.
Experiment Setup Yes We trained all specific string image hijacks with stochastic gradient descent, using a learning rate of 3 for patch-based attacks and 0.03 for all other attacks. For our training context set C, we used the instructions from the Alpaca training set (Taori et al., 2023), a dataset of 52,000 instruction-output pairs generated from Open AI s text-davinci-003.