Multilingual Jailbreak Challenges in Large Language Models

Authors: Yue Deng, Wenxuan Zhang, Sinno Jialin Pan, Lidong Bing

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this study, we reveal the presence of multilingual jailbreak challenges within LLMs and consider two potential risky scenarios: unintentional and intentional. The unintentional scenario involves users querying LLMs using non-English prompts and inadvertently bypassing the safety mechanisms, while the intentional scenario concerns malicious users combining malicious instructions with multilingual prompts to deliberately attack LLMs. The experimental results reveal that in the unintentional scenario, the rate of unsafe content increases as the availability of languages decreases. Specifically, low-resource languages exhibit about three times the likelihood of encountering harmful content compared to high-resource languages, with both Chat GPT and GPT-4.
Researcher Affiliation Collaboration Yue Deng 1,2 Wenxuan Zhang 1,3 Sinno Jialin Pan2,4 Lidong Bing1,3 1DAMO Academy, Alibaba Group, Singapore 2Nanyang Technological University, Singapore 3Hupan Lab, 310023, Hangzhou, China 4The Chinese University of Hong Kong, Hong Kong SAR
Pseudocode Yes Algorithm 1 SELF-DEFENCE
Open Source Code No Data is available at https: //github.com/DAMO-NLP-SG/multilingual-safety-for-LLMs. While a GitHub link is provided, the text explicitly states 'Data is available', not 'code' or 'source code'. Therefore, it does not provide concrete access to the source code for the methodology.
Open Datasets Yes We construct a curated dataset by gathering 15 harmful English prompts from the GPT-4 report (Open AI, 2023b). ... We further incorporate an additional 300 examples from Anthropic’s redteaming dataset (Ganguli et al., 2022).
Dataset Splits No The paper describes the creation of the Multi Jail dataset and its use for evaluation. For the SELF-DEFENSE framework, it mentions a 'training dataset' used for fine-tuning ('The resulting training dataset consists of 500 pairs across 10 languages. We fine-tune Chat GPT on this dataset for 3 epochs.') but does not specify validation or test splits for this fine-tuning process, nor does it provide general train/validation/test splits for the main evaluation experiments.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory, or cloud instance types) used for running the experiments. It refers to commercial models like 'Chat GPT' and 'GPT-4' and states that fine-tuning was done using 'Chat GPT and its fine-tuning capabilities' which implies using OpenAI's services, not specific hardware owned or controlled by the authors.
Software Dependencies No The paper mentions models like 'Chat GPT (GPT-3.5-turbo-0613)' and 'GPT-4 (GPT-4-0613)', and uses 'Google Translate', but does not list specific programming languages, libraries, or other software with version numbers used for implementation or analysis beyond these models/services.
Experiment Setup Yes To ensure consistent responses, we set the temperature to 0 and maintain default settings for other hyperparameters. ... We fine-tune Chat GPT on this dataset for 3 epochs.