Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Whispering Experts: Neural Interventions for Toxicity Mitigation in Language Models
Authors: Xavier Suau, Pieter Delobelle, Katherine Metcalf, Armand Joulin, Nicholas Apostoloff, Luca Zappella, Pau Rodriguez
ICML 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We show that AURA can achieve up to 2.2 reduction in toxicity with only a 0.72 perplexity increase. We also show that AURA is effective with models of different scale (from 1.5B to 40B parameters), and its effectiveness in mitigating toxic language, while preserving common-sense zero-shot abilities, holds across all scales. [...] In Table 1, we report toxicity mitigation results on the Honest and RTP datasets. [...] In Table 3: Impact of AURA on zero-shot common sense reasoning benchmarks. |
| Researcher Affiliation | Collaboration | Xavier Suau* 1 Pieter Delobelle* 1 2 Katherine Metcalf 1 Armand Joulin 1 Nicholas Apostoloff 1 Luca Zappella 1 Pau Rodríguez 1 [...] 1Apple 2KU Leuven. |
| Pseudocode | Yes | A. Algorithms In this section we provide pseudo-code for the algorithms to compute neuron expertise (Algorithm 1), as well as to implement Detzero (Algorithm 2), DAMP (Algorithm 3) and AURA (Algorithm 4). |
| Open Source Code | Yes | Code available at https://github.com/apple/ml-aura |
| Open Datasets | Yes | The Jigsaw 2018 dataset (Adams et al., 2017): comments from Wikipedia, labeled as toxic or not with subcategories: severely toxic, insults, identity hate and obscene. [...] Real Toxicity Prompts (Gehman et al., 2020) or RTP is a completion benchmark that uses a classifier to detect toxicity. |
| Dataset Splits | No | The paper describes the datasets used (e.g., Jigsaw, Real Toxicity Prompts) but does not provide explicit train/validation/test split percentages, sample counts, or specific methodologies for partitioning the data into these sets for reproduction. |
| Hardware Specification | No | The paper does not provide specific details regarding the hardware (e.g., GPU models, CPU types, or cloud computing instances) used for running the experiments. |
| Software Dependencies | No | To generate the completions to the prompts, we use the generate function from the Hugging Face transformers library, which automatically sets several hyperparameters (beams = 1, top-50 multinomial sampling, temperature = 1) based on the model s configuration. [...] The layer type column shows the pattern to match the layer names in the Pytorch implementation from Huggingface. |
| Experiment Setup | Yes | To generate the completions to the prompts, we use the generate function from the Hugging Face transformers library, which automatically sets several hyperparameters (beams = 1, top-50 multinomial sampling, temperature = 1) based on the model s configuration. [...] We use this model with the control code Wikipedia , which has a low level of toxicity. We also enforce a repetition penalty θ = 1.2, as recommended by Keskar et al. (2019) because all generations would just repeat tokens otherwise. [...] To define the toxicity concept we use 500 toxic sentences and 2000 non-toxic sentences from the Toxic category of the Jigsaw dataset. |