reproducibilityindex.ai

Toxicity Detection for Free

Authors: Zhanhao Hu, Julien Piet, Geng Zhao, Jiantao Jiao, David Wagner

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this paper, we introduce Moderation Using LLM Introspection (MULI), which detects toxic prompts using the information extracted directly from LLMs themselves. We found we can distinguish between benign and toxic prompts from the distribution of the first response token s logits. Using this idea, we build a robust detector of toxic prompts using a sparse logistic regression model on the first response token logits. Our scheme outperforms SOTA detectors under multiple metrics. and 6 Experiments
Researcher Affiliation	Academia	Zhanhao Hu Julien Piet Geng Zhao Jiantao Jiao David Wagner University of California, Berkeley {huzhanhao,julien.piet,gengzhao,jiantao,daw}@berkeley.edu
Pseudocode	No	The paper describes the proposed method using equations and descriptive text, but it does not include a formal pseudocode or algorithm block.
Open Source Code	Yes	We released our code on Git Hub1. 1https://github.com/Who THU/detection_logits
Open Datasets	Yes	We used the prompts in the Toxic Chat [14] and LMSYS-Chat-1M [31] datasets for evaluation, and included the Open AI Moderation API Evaluation dataset for cross-dataset validation [17].
Dataset Splits	No	The training split of Toxic Chat consists of 4698 benign prompts and 384 toxic prompts, the latter including 113 jailbreaking prompts. The test split contains 4721 benign prompts and 362 toxic prompts (the latter includes 91 jailbreaking prompts). The paper defines training and test splits with specific counts but does not explicitly mention or quantify a validation split.
Hardware Specification	No	The paper mentions running experiments and evaluating models, but it does not provide specific details on the hardware used, such as GPU models, CPU types, or memory.
Software Dependencies	No	The paper describes the experimental setup and parameters but does not specify software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions or specific library versions).
Experiment Setup	Yes	For llama-2-7b, we set λ = 1 10 3 in Equation (5) and optimized the parameters w and b for 500 epochs by Stochastic Gradient Descent with a learning rate of 5 10 4 and batch size 128.