Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Discovering Salient Neurons in deep NLP models

Authors: Nadir Durrani, Fahim Dalvi, Hassan Sajjad

JMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We carry out a ﬁne-grained analysis to answer the following questions: (i) can we identify subsets of neurons in the network that learn a speciﬁc linguistic property? (ii) is a certain linguistic phenomenon in a given model localized (encoded in few individual neurons) or distributed across many neurons? (iii) how redundantly is the information preserved? (iv) how does ﬁne-tuning pre-trained models towards downstream NLP tasks impact the learned linguistic knowledge? (v) how do models vary in learning diﬀerent linguistic properties? Our data-driven, quantitative analysis illuminates interesting ﬁndings: (i) we found small subsets of neurons that can predict diﬀerent linguistic tasks;... We evaluate our method using i) ablation study, ii) classiﬁer retraining, iii) selectivity, and iv) via qualitative evaluation.
Researcher Affiliation	Academia	Nadir Durrani EMAIL Qatar Computing Research Institute Hamad Bin Khalifa University Doha, Qatar Fahim Dalvi EMAIL Qatar Computing Research Institute Hamad Bin Khalifa University Doha, Qatar Hassan Sajjad EMAIL Faculty of Computer Science Dalhousie University Halifax, Nova Scotia, Canada
Pseudocode	Yes	Algorithm 1 Probe training (Section 2.1) 1: function Train Probe(X, y, λ1, λ2) ... Algorithm 2 Grid Search (Section 2.2) 1: function Grid Search(X, y, α, β, M) ... Algorithm 3 Neuron Ranking Extraction (Section 2.3) 1: function Get Neuron Ranking(probe) ... Algorithm 4 Minimal Neuron Subset (Section 2.4) 1: function Get Minimum Neurons(X, y, δ)
Open Source Code	Yes	Our code is publicly available as part of the Neuro X toolkit (Dalvi et al., 2023).1 1. https://github.com/fdalvi/Neuro X
Open Datasets	Yes	Language Tasks: We evaluated our method on 6 linguistic tasks: suﬃx prediction, partsof-speech tagging using the Penn Tree Bank (Marcus et al., 1993), syntactic chunking using Co NLL 2000 shared task dataset (Tjong Kim Sang and Buchholz, 2000), CCG super-tagging using CCGBank (Hockenmaier, 2006), syntactic dependency labeling with Universal Dependencies data-set and semantic tagging using the Parallel Meaning Bank data (Abzianidze et al., 2017). ... Speciﬁcally, we experimented with SST-2 for sentiment analysis with the Stanford sentiment tree-bank (Socher et al., 2013), MNLI for natural language inference (Williams et al., 2018), QNLI for Question NLI (Rajpurkar et al., 2016), RTE for recognizing textual entailment (Bentivogli et al., 2009), MRPC for Microsoft Research paraphrase corpus (Dolan and Brockett, 2005), and STS-B for the semantic textual similarity benchmark (Cer et al., 2017).
Dataset Splits	Yes	We used standard splits for training, development and test data (See Table 8 in Appendix). For multilingual experiments, we annotated a small portion of multi-parallel news data (Bojar et al., 2014) for English, German and French using RDRPOSTagger (Nguyen et al., 2014) (See Table 9 in Appendix for statistics). ... Table 8: Data statistics (number of sentences) on training, development and test sets using in the experiments and the number of tags to be predicted. Task Train Dev Test Tags Suﬃx 40000 5000 5000 58 POS 36557 1802 1963 44 SEM 36928 5301 10600 73 Chunking 8881 1843 2011 22 CCG 39101 1908 2404 1272 Table 9: Data statistics (number of sentences) on training, development and test sets using in the experiments and the number of POS tags and Syntactic Dependency Relations to be predicted in multilingual experiments. Task Train Dev Test Tags POS (en) 14498 3000 8172 44 POS (de) 14498 3000 8172 52 POS (fr) 11495 3000 3003 13 Syntactic Dependency (en) 11663 1914 3828 49 Syntactic Dependency (de) 14118 1775 1776 35 Syntactic Dependency (fr) 14552 1895 1894 40
Hardware Specification	No	The paper does not explicitly mention the specific hardware (e.g., GPU/CPU models, RAM) used to run the experiments. It only details the transformer models used and classifier settings.
Software Dependencies	No	The paper mentions using Adam optimizer and RDRPOSTagger for annotation, but does not specify version numbers for any key software libraries or dependencies (e.g., Python, PyTorch, TensorFlow, scikit-learn versions) required to reproduce the methodology.
Experiment Setup	Yes	Algorithm 1 Probe training (Section 2.1) 2: Initialize learning rate η = 0.001, number of epochs N = 10 ... The training process involved shuﬄed mini-batches of size 512 and was stopped after 10 epochs. The regularization weights were trained using a grid-search algorithm. ... We set M = 20% and α, β = 0.5.