Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Redefining Experts: Interpretable Decomposition of Language Models for Toxicity Mitigation
Authors: Zuhair Hasan Shaik, Abdullah Mazhar, Aseem Srivastava, Md Shad Akhtar
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through extensive experiments on Jigsaw and Toxi CN datasets, we show that aggregated layer-wise features provide more robust signals than single neurons. Moreover, we observe conceptual limitations in prior works that conflate toxicity detection experts and generation experts within neuron-based interventions. |
| Researcher Affiliation | Academia | Zuhair Hasan Shaik MBZUAI, UAE EMAIL Abdullah Mazhar IIIT Delhi, India EMAIL Aseem Srivastava MBZUAI, UAE EMAIL Md. Shad Akhtar IIIT Delhi, India EMAIL |
| Pseudocode | Yes | Algorithm 1: Expert Finding Using AP Score |
| Open Source Code | Yes | Our contributions are summarized as follows2: Code Repository: https://github.com/flamenlp/Eigen Shift |
| Open Datasets | Yes | In our work, we conduct a comprehensive analysis using two toxicity detection datasets: Jigsaw (English) [16] and Toxi CN (Chinese) [19]. [16] Jigsaw and Conversation AI. Toxic comment classification challenge. https://www.kaggle. com/c/jigsaw-toxic-comment-classification-challenge, 2018. Accessed: 2025-0513. |
| Dataset Splits | No | The paper mentions using specific datasets and sampling strategies (e.g., 'stratified and sampled 6,090 toxic examples' for Jigsaw) and refers to the 'standard Real Toxic Prompts dataset' and a 'fixed snapshot of the English Wikipedia corpus'. However, it does not explicitly provide the train/test/validation splits (e.g., as percentages or counts) for the datasets used in its own experiments. |
| Hardware Specification | Yes | For the Jigsaw dataset, we use three NVIDIA Tesla V100 GPUs, each with 32GB of VRAM, totaling 96GB of GPU memory. For the Toxi CN dataset, which is comparatively smaller, we use an NVIDIA A6000 GPU with 40GB of VRAM. |
| Software Dependencies | No | The paper lists various language models used (e.g., BERT, BART, Llama-3.1, Mistral, GLM-4) but does not provide specific version numbers for underlying software frameworks or libraries (e.g., PyTorch version, Python version) that would be needed for replication. |
| Experiment Setup | Yes | We conduct experiments to investigate how varying the parameters α and Top_k impacts both toxicity reduction and perplexity, and the TPH score. We provide additional findings in Appendix Section C.5 (c.f. Table 6 and Figure 4). Table 6: Toxicity and Perplexity results across various α and top_k values. |