Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
IF-Guide: Influence Function-Guided Detoxification of LLMs
Authors: Zachary Coalson, Juhan Bae, Nicholas Carlini, Sanghyun Hong
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In our evaluation, we demonstrate that IF-GUIDE substantially reduces both explicit and implicit toxicity by up to 10 compared to uncensored models, and up to 3 compared to baseline alignment methods such as DPO and RAD across both pre-training and fine-tuning scenarios. 4 Evaluation |
| Researcher Affiliation | Collaboration | Zachary Coalson1, Juhan Bae2, Nicholas Carlini3, Sanghyun Hong1 1Oregon State University, 2University of Toronto, 3Anthropic |
| Pseudocode | Yes | Algorithm 1 Toxic Token Selection 1: Require Training data {x1, . . . , x N}, influence scores {Sij}, toxicity threshold τtox, window size w, token limit L 2: // Rank documents by toxicity 3: For i = 1 to N: 4: Compute sparsity: si Pj 1{Sij > τtox} 5: Compute score: fi Pj Sij 1{Sij > τtox} 6: Min-max normalize {si}N i=1 and {fi}N i=1 7: For i = 1 to N: 8: Compute rank: Ri 2sifi si+fi 9: // Construct toxic token sets 10: Initialize Ti for all i; total selected C 0 11: For each i in argsort({Ri}) descending: 12: For each j with Sij > τtox: 13: // Add w tokens of context for each toxic token 14: For k = max(1, j w) to min(|xi|, j + w): 15: If k / Ti: 16: Add k to Ti; C C + 1 17: If C L: 18: Return toxic token sets {Ti}N i=1 19: Return toxic token sets {Ti}N i=1 |
| Open Source Code | Yes | Our code is publicly available at: https://github.com/ztcoalson/IF-Guide. |
| Open Datasets | Yes | We train each model on a randomly sampled one billion-token subset of Open Web Text [22], a large corpus that fits within our academic compute budget. We evaluate IF-GUIDE s effectiveness on Real Toxicity Prompts (RTP) [20], a benchmark designed to measure a model s propensity to generate toxic content. Following recent work [34], we also consider BOLD [15], which focuses on demographic biases, and Atta Q [37], which contains adversarial questions designed to induce unsafe generations. We also evaluate accuracy (Acc.) on the last-token prediction task from LAMBADA [60], which measures a model s ability to understand long-range dependencies in narrative passages. |
| Dataset Splits | Yes | We train each model on a randomly sampled one billion-token subset of Open Web Text [22], a large corpus that fits within our academic compute budget. We train all models for four epochs, which prior work has found offers the best compute-performance trade-off at this scale [55]. We evaluate performance on the training distribution by reporting perplexity (PPL) on a test set of 10 million tokens from Open Web Text. |
| Hardware Specification | Yes | We run all experiments on two machines: the first has an Intel Xeon Processor with 48 cores, 768GB of memory, and 8 Nvidia A40 GPUs. The second has an Intel Xeon Processor with 112 cores, 2TB of memory, and 8 Nvidia H100 GPUs. |
| Software Dependencies | Yes | We implement IF-GUIDE using Python v3.10.16 and Py Torch v2.5.1, which supports CUDA 11.8 for GPU usage. |
| Experiment Setup | Yes | Table 3 shows the exact hyperparameters we use for pre-training and fine-tuning. For pre-training with IF-GUIDE, we minimize our proposed loss objective (Eq. 9); otherwise, we use the standard cross-entropy loss. All training runs use the Adam W optimizer [51]. |