Whispering Experts: Neural Interventions for Toxicity Mitigation in Language Models

Authors: Xavier Suau, Pieter Delobelle, Katherine Metcalf, Armand Joulin, Nicholas Apostoloff, Luca Zappella, Pau Rodriguez

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show that AURA can achieve up to 2.2 reduction in toxicity with only a 0.72 perplexity increase. We also show that AURA is effective with models of different scale (from 1.5B to 40B parameters), and its effectiveness in mitigating toxic language, while preserving common-sense zero-shot abilities, holds across all scales. [...] In Table 1, we report toxicity mitigation results on the Honest and RTP datasets. [...] In Table 3: Impact of AURA on zero-shot common sense reasoning benchmarks.
Researcher Affiliation Collaboration Xavier Suau* 1 Pieter Delobelle* 1 2 Katherine Metcalf 1 Armand Joulin 1 Nicholas Apostoloff 1 Luca Zappella 1 Pau Rodríguez 1 [...] 1Apple 2KU Leuven.
Pseudocode Yes A. Algorithms In this section we provide pseudo-code for the algorithms to compute neuron expertise (Algorithm 1), as well as to implement Detzero (Algorithm 2), DAMP (Algorithm 3) and AURA (Algorithm 4).
Open Source Code Yes Code available at https://github.com/apple/ml-aura
Open Datasets Yes The Jigsaw 2018 dataset (Adams et al., 2017): comments from Wikipedia, labeled as toxic or not with subcategories: severely toxic, insults, identity hate and obscene. [...] Real Toxicity Prompts (Gehman et al., 2020) or RTP is a completion benchmark that uses a classifier to detect toxicity.
Dataset Splits No The paper describes the datasets used (e.g., Jigsaw, Real Toxicity Prompts) but does not provide explicit train/validation/test split percentages, sample counts, or specific methodologies for partitioning the data into these sets for reproduction.
Hardware Specification No The paper does not provide specific details regarding the hardware (e.g., GPU models, CPU types, or cloud computing instances) used for running the experiments.
Software Dependencies No To generate the completions to the prompts, we use the generate function from the Hugging Face transformers library, which automatically sets several hyperparameters (beams = 1, top-50 multinomial sampling, temperature = 1) based on the model s configuration. [...] The layer type column shows the pattern to match the layer names in the Pytorch implementation from Huggingface.
Experiment Setup Yes To generate the completions to the prompts, we use the generate function from the Hugging Face transformers library, which automatically sets several hyperparameters (beams = 1, top-50 multinomial sampling, temperature = 1) based on the model s configuration. [...] We use this model with the control code Wikipedia , which has a low level of toxicity. We also enforce a repetition penalty θ = 1.2, as recommended by Keskar et al. (2019) because all generations would just repeat tokens otherwise. [...] To define the toxicity concept we use 500 toxic sentences and 2000 non-toxic sentences from the Toxic category of the Jigsaw dataset.