Whispering Experts: Neural Interventions for Toxicity Mitigation in Language Models
Authors: Xavier Suau, Pieter Delobelle, Katherine Metcalf, Armand Joulin, Nicholas Apostoloff, Luca Zappella, Pau Rodriguez
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We show that AURA can achieve up to 2.2 reduction in toxicity with only a 0.72 perplexity increase. We also show that AURA is effective with models of different scale (from 1.5B to 40B parameters), and its effectiveness in mitigating toxic language, while preserving common-sense zero-shot abilities, holds across all scales. [...] In Table 1, we report toxicity mitigation results on the Honest and RTP datasets. [...] In Table 3: Impact of AURA on zero-shot common sense reasoning benchmarks. |
| Researcher Affiliation | Collaboration | Xavier Suau* 1 Pieter Delobelle* 1 2 Katherine Metcalf 1 Armand Joulin 1 Nicholas Apostoloff 1 Luca Zappella 1 Pau Rodríguez 1 [...] 1Apple 2KU Leuven. |
| Pseudocode | Yes | A. Algorithms In this section we provide pseudo-code for the algorithms to compute neuron expertise (Algorithm 1), as well as to implement Detzero (Algorithm 2), DAMP (Algorithm 3) and AURA (Algorithm 4). |
| Open Source Code | Yes | Code available at https://github.com/apple/ml-aura |
| Open Datasets | Yes | The Jigsaw 2018 dataset (Adams et al., 2017): comments from Wikipedia, labeled as toxic or not with subcategories: severely toxic, insults, identity hate and obscene. [...] Real Toxicity Prompts (Gehman et al., 2020) or RTP is a completion benchmark that uses a classifier to detect toxicity. |
| Dataset Splits | No | The paper describes the datasets used (e.g., Jigsaw, Real Toxicity Prompts) but does not provide explicit train/validation/test split percentages, sample counts, or specific methodologies for partitioning the data into these sets for reproduction. |
| Hardware Specification | No | The paper does not provide specific details regarding the hardware (e.g., GPU models, CPU types, or cloud computing instances) used for running the experiments. |
| Software Dependencies | No | To generate the completions to the prompts, we use the generate function from the Hugging Face transformers library, which automatically sets several hyperparameters (beams = 1, top-50 multinomial sampling, temperature = 1) based on the model s configuration. [...] The layer type column shows the pattern to match the layer names in the Pytorch implementation from Huggingface. |
| Experiment Setup | Yes | To generate the completions to the prompts, we use the generate function from the Hugging Face transformers library, which automatically sets several hyperparameters (beams = 1, top-50 multinomial sampling, temperature = 1) based on the model s configuration. [...] We use this model with the control code Wikipedia , which has a low level of toxicity. We also enforce a repetition penalty θ = 1.2, as recommended by Keskar et al. (2019) because all generations would just repeat tokens otherwise. [...] To define the toxicity concept we use 500 toxic sentences and 2000 non-toxic sentences from the Toxic category of the Jigsaw dataset. |