Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Finding Neurons in a Haystack: Case Studies with Sparse Probing

Authors: Wes Gurnee, Neel Nanda, Matthew Pauly, Katherine Harvey, Dmitrii Troitskii, Dimitris Bertsimas

TMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this work, we seek to understand how high-level human-interpretable features are represented within the internal neuron activations of LLMs. We train k-sparse linear classifiers (probes) on these internal activations to predict the presence of features in the input; by varying the value of k we study the sparsity of learned representations and how this varies with model scale. ... In all, we probe for over 100 unique features comprising 10 different categories in 7 different models spanning 70 million to 6.9 billion parameters. ... We conduct a series of more detailed case studies to carefully study the behavior of individual neurons, while also illustrating the challenges that pose barriers to further progress.
Researcher Affiliation Academia Wes Gurnee EMAIL Massachusetts Institute of Technology Neel Nanda EMAIL Independent Matthew Pauly EMAIL Harvard University Katherine Harvey EMAIL Harvard University Dmitrii Troitskii EMAIL Northeastern University Dimitris Bertsimas EMAIL Massachusetts Institute of Technology
Pseudocode No The paper includes mathematical equations and descriptions of methods but does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes All code and data are available at https://github.com/wesg52/sparse-probing-paper.
Open Datasets Yes We study ten different feature collections: the natural language of Europarl documents, the programming language of Github source files, the data source of documents from The Pile... Full descriptions of datasets, their construction, and summary statistics are available in B.2. ... B.2 Datasets: We used both the raw text and the labels from the Euro Parl subset of the pile... We then used a code recognition package to classify the type of code... For our linguistic features, we use the text and labels from the well known Penn Treebank Corpus.
Dataset Splits No To evaluate the performance of a probe, we compute the number of true positives (TP), false positives (FP), false negatives (FN), and true negatives (TN) of our binary classifier on an out-of-sample test set. ... While the paper mentions using an 'out-of-sample test set', it does not provide specific percentages, sample counts, or citations to predefined train/validation/test splits for the datasets used in the experiments. It describes how some probing datasets were constructed with specific numbers of positive and negative examples, but not how those were split for training and testing.
Hardware Specification No The paper does not explicitly describe the hardware (e.g., specific GPU or CPU models, memory details) used to run the experiments. It only mentions the Pythia models (e.g., Pythia 70M to Pythia 6.9B) and their architectural parameters.
Software Dependencies No The paper mentions using 'Eleuther AI’s Pythia suite of autoregressive transformer language models (Biderman et al., 2023)' and provides a GitHub link to it, along with a note that 'our experiments were performed with the V0 suite of models'. However, it does not provide specific version numbers for ancillary software components such as programming languages (e.g., Python), libraries (e.g., PyTorch, TensorFlow), or other dependencies.
Experiment Setup No The paper describes the types of probes used (e.g., k-sparse linear classifiers, logistic regression with elasticnet regularization, cardinality-constrained SVM with hinge-loss) and mentions that 'balanced class weights' were used and 'hyperparameters' were selected. It also notes a 'timeout of 60 seconds' for Optimal Sparse Probing. However, it does not provide specific numerical values for these hyperparameters such as learning rates, batch sizes, number of epochs, or the specific regularization coefficients (e.g., lambda values for elasticnet) that were selected and used.