Leashing the Inner Demons: Self-Detoxification for Language Models

Authors: Canwen Xu, Zexue He, Zhankui He, Julian McAuley11530-11537

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, we conduct extensive experiments to study this phenomenon. We analyze the impact of prompts, decoding strategies and training corpora on the output toxicity.
Researcher Affiliation Academia University of California, San Diego {cxu,zehe,zhh004,jmcauley}@ucsd.edu
Pseudocode No The paper describes the methodology in prose and with a workflow diagram (Figure 1), but does not contain any structured pseudocode or algorithm blocks.
Open Source Code No The paper mentions its implementation is based on Hugging Face Transformers but does not provide a direct link or explicit statement for the availability of its own source code.
Open Datasets Yes We sample 5,000 prompts from Writing Prompts (Fan, Lewis, and Dauphin 2018). ... we use 5,000 prompts associated with the highest toxicity from Real Toxic Prompts (Gehman et al. 2020).
Dataset Splits No The paper does not explicitly provide training/validation/test dataset splits. It mentions using 5,000 prompts for generation and then fine-tuning on the generated text, but no specific validation split for the fine-tuning or evaluation is stated.
Hardware Specification Yes We generate text on an Nvidia V100, requiring around 12h to generate 5,000 samples.
Software Dependencies No The paper states 'Our implementation is based on Hugging Face Transformers (Wolf et al. 2020)' but does not provide specific version numbers for this or any other software dependencies.
Experiment Setup Yes The maximum generation length is set to 200. The temperature is set to 1 for top-k, top-p, and beam search.