Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Detoxifying Large Language Models via Autoregressive Reward Guided Representation Editing
Authors: Yisong Xiao, Aishan Liu, Siyuan Liang, Zonghao Ying, Xianglong Liu, Dacheng Tao
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments across 8 widely used LLMs show that ARGRE significantly outperforms leading baselines in effectiveness (-62.21% toxicity) and efficiency (-47.58% inference time), while preserving the core capabilities of the original model with minimal degradation. |
| Researcher Affiliation | Academia | Yisong Xiao1, Aishan Liu1 , Siyuan Liang2, Zonghao Ying1, Xianglong Liu1,3,4, Dacheng Tao5 1SKLCCSE, Beihang University 2National University of Singapore 3Zhongguancun Laboratory, Beijing 4Institute of Dataspace, Hefei 5Nanyang Technological University |
| Pseudocode | No | The paper describes its methodology using textual explanations and mathematical equations, but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code is available on the website. The paper provides open access to the data and code, the anonymous link is https://anonymous.4open.science/r/ARGRE-6291. |
| Open Datasets | Yes | For toxicity annotations, we adopt the pairwise toxic dataset from [34], where non-toxic sequences are sampled from Wikitext2 [67], and toxic counterparts are generated using PPLM [51]. We adopt the challenge subset of Real Toxicity Prompts [24], which contains 1,199 prompts known to elicit highly toxic continuations from language models. |
| Dataset Splits | Yes | For toxicity annotations, we adopt the pairwise toxic dataset from [34], where non-toxic sequences are sampled from Wikitext2 [67], and toxic counterparts are generated using PPLM [51]. We first measure the model s perplexity on the Wiki Text-2 [67] development split, which contains 2,064 samples. we perform 2-fold cross-validation on the 654 samples using Mistral 7B. We use the 128 pairwise benign harmful annotations from [85] as training data. |
| Hardware Specification | Yes | Experiments are conducted on a server with Intel(R) Xeon(R) Gold 6336Y CPU @ 2.40GHz, 512GB system memory, and six NVIDIA A100 GPUs with 40GB memory. |
| Software Dependencies | No | These models are accessed via the Hugging Face library, with access details summarized in Tab. 11. We utilize the official codebase* of Pro FS [33]. The paper mentions various libraries and codebases used (e.g., Hugging Face library, official codebases for Pro FS, Re-Control, Gen ARM, and DPO implementation from [34]), but it does not specify concrete version numbers for any of these software dependencies. |
| Experiment Setup | Yes | Our auto-regressive reward model is implemented using a two-layer MLP with a hidden size of 1024. We train the model for three epochs with a learning rate of 5 10 4 and βr = 0.05, and set β = 1 during inference. In our main experiments, we consistently use the following hyperparameters: the number of interpolated trajectories Nin is set to 7, and gradient-based optimization is performed for 5 iterations with a step size of η = 0.5. |