Multilingual Jailbreak Challenges in Large Language Models
Authors: Yue Deng, Wenxuan Zhang, Sinno Jialin Pan, Lidong Bing
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this study, we reveal the presence of multilingual jailbreak challenges within LLMs and consider two potential risky scenarios: unintentional and intentional. The unintentional scenario involves users querying LLMs using non-English prompts and inadvertently bypassing the safety mechanisms, while the intentional scenario concerns malicious users combining malicious instructions with multilingual prompts to deliberately attack LLMs. The experimental results reveal that in the unintentional scenario, the rate of unsafe content increases as the availability of languages decreases. Specifically, low-resource languages exhibit about three times the likelihood of encountering harmful content compared to high-resource languages, with both Chat GPT and GPT-4. |
| Researcher Affiliation | Collaboration | Yue Deng 1,2 Wenxuan Zhang 1,3 Sinno Jialin Pan2,4 Lidong Bing1,3 1DAMO Academy, Alibaba Group, Singapore 2Nanyang Technological University, Singapore 3Hupan Lab, 310023, Hangzhou, China 4The Chinese University of Hong Kong, Hong Kong SAR |
| Pseudocode | Yes | Algorithm 1 SELF-DEFENCE |
| Open Source Code | No | Data is available at https: //github.com/DAMO-NLP-SG/multilingual-safety-for-LLMs. While a GitHub link is provided, the text explicitly states 'Data is available', not 'code' or 'source code'. Therefore, it does not provide concrete access to the source code for the methodology. |
| Open Datasets | Yes | We construct a curated dataset by gathering 15 harmful English prompts from the GPT-4 report (Open AI, 2023b). ... We further incorporate an additional 300 examples from Anthropic’s redteaming dataset (Ganguli et al., 2022). |
| Dataset Splits | No | The paper describes the creation of the Multi Jail dataset and its use for evaluation. For the SELF-DEFENSE framework, it mentions a 'training dataset' used for fine-tuning ('The resulting training dataset consists of 500 pairs across 10 languages. We fine-tune Chat GPT on this dataset for 3 epochs.') but does not specify validation or test splits for this fine-tuning process, nor does it provide general train/validation/test splits for the main evaluation experiments. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory, or cloud instance types) used for running the experiments. It refers to commercial models like 'Chat GPT' and 'GPT-4' and states that fine-tuning was done using 'Chat GPT and its fine-tuning capabilities' which implies using OpenAI's services, not specific hardware owned or controlled by the authors. |
| Software Dependencies | No | The paper mentions models like 'Chat GPT (GPT-3.5-turbo-0613)' and 'GPT-4 (GPT-4-0613)', and uses 'Google Translate', but does not list specific programming languages, libraries, or other software with version numbers used for implementation or analysis beyond these models/services. |
| Experiment Setup | Yes | To ensure consistent responses, we set the temperature to 0 and maintain default settings for other hyperparameters. ... We fine-tune Chat GPT on this dataset for 3 epochs. |