Toxicity Detection for Free
Authors: Zhanhao Hu, Julien Piet, Geng Zhao, Jiantao Jiao, David Wagner
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper, we introduce Moderation Using LLM Introspection (MULI), which detects toxic prompts using the information extracted directly from LLMs themselves. We found we can distinguish between benign and toxic prompts from the distribution of the first response token s logits. Using this idea, we build a robust detector of toxic prompts using a sparse logistic regression model on the first response token logits. Our scheme outperforms SOTA detectors under multiple metrics. and 6 Experiments |
| Researcher Affiliation | Academia | Zhanhao Hu Julien Piet Geng Zhao Jiantao Jiao David Wagner University of California, Berkeley {huzhanhao,julien.piet,gengzhao,jiantao,daw}@berkeley.edu |
| Pseudocode | No | The paper describes the proposed method using equations and descriptive text, but it does not include a formal pseudocode or algorithm block. |
| Open Source Code | Yes | We released our code on Git Hub1. 1https://github.com/Who THU/detection_logits |
| Open Datasets | Yes | We used the prompts in the Toxic Chat [14] and LMSYS-Chat-1M [31] datasets for evaluation, and included the Open AI Moderation API Evaluation dataset for cross-dataset validation [17]. |
| Dataset Splits | No | The training split of Toxic Chat consists of 4698 benign prompts and 384 toxic prompts, the latter including 113 jailbreaking prompts. The test split contains 4721 benign prompts and 362 toxic prompts (the latter includes 91 jailbreaking prompts). The paper defines training and test splits with specific counts but does not explicitly mention or quantify a validation split. |
| Hardware Specification | No | The paper mentions running experiments and evaluating models, but it does not provide specific details on the hardware used, such as GPU models, CPU types, or memory. |
| Software Dependencies | No | The paper describes the experimental setup and parameters but does not specify software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions or specific library versions). |
| Experiment Setup | Yes | For llama-2-7b, we set λ = 1 10 3 in Equation (5) and optimized the parameters w and b for 500 epochs by Stochastic Gradient Descent with a learning rate of 5 10 4 and batch size 128. |