Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Capturing Polysemanticity with PRISM: A Multi-Concept Feature Description Framework
Authors: Laura Kopf, Nils Feldhus, Kirill Bykov, Philine L Bommer, Anna Hedström, Marina Höhne, Oliver Eberle
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We apply PRISM to LLMs and, through extensive benchmarking against existing methods, demonstrate that our approach produces more accurate and faithful feature descriptions, improving both overall description quality (via a description score) and the ability to capture distinct concepts when polysemanticity is present (via a polysemanticity score). 4 Quantitative Evaluation In the following, we quantitatively evaluate our proposed feature description method, PRISM, against existing approaches for neuron and SAE feature interpretation. 4.1 Experimental Setup In our experiments, we evaluate PRISM against competitive feature description methods... |
| Researcher Affiliation | Academia | 1Technische Universität Berlin, Germany 2BIFOLD, Germany 3UMI Lab, ATB Potsdam, Germany 4Fraunhofer Heinrich-Hertz-Institute, Germany 5ETH AI Center, Switzerland 6Universität Potsdam, Germany 7Munich Center for Machine Learning (MCML) 8Technische Universität München |
| Pseudocode | No | No specific pseudocode or algorithm block is explicitly labeled or formatted as such. Section 3.1 describes the framework steps in prose: "Our multi-concept framework consists of the following steps in reference to Figure 2: 1. Percentile Sampling. 2. Concept Clustering. 3. Cluster Labeling." |
| Open Source Code | Yes | Our code is made publicly available to the community.1 https://github.com/lkopf/prism |
| Open Datasets | Yes | Unless otherwise specified, all experiments are conducted on the English training subset of the C4 CORPUS [41], a large, cleaned version of Common Crawl s web crawl corpus. As the control dataset X0, we use a subset of 1,000 randomly sampled entries from Cosmopedia [40]. |
| Dataset Splits | Yes | Given the corresponding activations A0 Rn for X0 and A1 Rm for X1, the AUROC is computed as... The Mean Activation Difference (MAD) quantifies the normalized difference between the mean activation on the target and control datasets: For each candidate description of a target feature, we use Gemini 1.5 Pro [47] to generate 10 concept-specific text samples, each with a maximum length of 512 tokens. These samples form the concept dataset X1. The generation prompt is shown in Figure 8. We then pass both datasets through the model to extract activations corresponding to the target feature. ...control dataset X0, consisting of 1,000 randomly sampled entries from Cosmopedia [40]. |
| Hardware Specification | Yes | Compute Resources All experiments were conducted using a single NVIDIA A100 80GB GPU. |
| Software Dependencies | No | The paper mentions specific LLM models (e.g., "GPT-2 XL [43]", "Llama 3.1 8B Instruct [44]", "Gemini 1.5 Pro [47]"), sentence embedders ("gte-Qwen2-1.5B-instruct sentence transformer [46]"), and methods ("K-Means method [38]", "P2 algorithm [37]"), but it does not specify the versions of general software components or libraries such as Python, PyTorch, or CUDA. |
| Experiment Setup | Yes | For (1) Percentile Sampling, we identify all text excerpts whose mean activation values fall within the 99th-100th percentile, sampling one excerpt per percentile bin with a step size of 1e-05, resulting in 1000 high-activation excerpts per feature. For (2) Concept Clustering, the resulting text set is embedded using the gte-Qwen2-1.5B-instruct sentence transformer [46], and then k-means clustering is applied with k = 5 to uncover recurring conceptual patterns. For (3) Cluster Labeling, ... we prompt a large language model (Gemini 1.5 Pro [47]) using the Ns = 20 text excerpts with the highest mean activations for each cluster. Additional prompt details are provided in Appendix A.4. |