Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Redefining Experts: Interpretable Decomposition of Language Models for Toxicity Mitigation

Authors: Zuhair Hasan Shaik, Abdullah Mazhar, Aseem Srivastava, Md Shad Akhtar

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through extensive experiments on Jigsaw and Toxi CN datasets, we show that aggregated layer-wise features provide more robust signals than single neurons. Moreover, we observe conceptual limitations in prior works that conflate toxicity detection experts and generation experts within neuron-based interventions.
Researcher Affiliation	Academia	Zuhair Hasan Shaik MBZUAI, UAE EMAIL Abdullah Mazhar IIIT Delhi, India EMAIL Aseem Srivastava MBZUAI, UAE EMAIL Md. Shad Akhtar IIIT Delhi, India EMAIL
Pseudocode	Yes	Algorithm 1: Expert Finding Using AP Score
Open Source Code	Yes	Our contributions are summarized as follows2: Code Repository: https://github.com/flamenlp/Eigen Shift
Open Datasets	Yes	In our work, we conduct a comprehensive analysis using two toxicity detection datasets: Jigsaw (English) [16] and Toxi CN (Chinese) [19]. [16] Jigsaw and Conversation AI. Toxic comment classification challenge. https://www.kaggle. com/c/jigsaw-toxic-comment-classification-challenge, 2018. Accessed: 2025-0513.
Dataset Splits	No	The paper mentions using specific datasets and sampling strategies (e.g., 'stratified and sampled 6,090 toxic examples' for Jigsaw) and refers to the 'standard Real Toxic Prompts dataset' and a 'fixed snapshot of the English Wikipedia corpus'. However, it does not explicitly provide the train/test/validation splits (e.g., as percentages or counts) for the datasets used in its own experiments.
Hardware Specification	Yes	For the Jigsaw dataset, we use three NVIDIA Tesla V100 GPUs, each with 32GB of VRAM, totaling 96GB of GPU memory. For the Toxi CN dataset, which is comparatively smaller, we use an NVIDIA A6000 GPU with 40GB of VRAM.
Software Dependencies	No	The paper lists various language models used (e.g., BERT, BART, Llama-3.1, Mistral, GLM-4) but does not provide specific version numbers for underlying software frameworks or libraries (e.g., PyTorch version, Python version) that would be needed for replication.
Experiment Setup	Yes	We conduct experiments to investigate how varying the parameters α and Top_k impacts both toxicity reduction and perplexity, and the TPH score. We provide additional findings in Appendix Section C.5 (c.f. Table 6 and Figure 4). Table 6: Toxicity and Perplexity results across various α and top_k values.