Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Concept-Guided Interpretability via Neural Chunking
Authors: Shuchen Wu, Stephan Alaniz, Shyamgopal Karthik, Peter Dayan, Eric Schulz, Zeynep Akata
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the effectiveness of these methods in extracting concept-encoding entities agnostic to model architectures. These concepts can be both concrete (words), abstract (POS tags), or structural (narrative schema). Additionally, we show that extracted chunks play a causal role in network behavior, as grafting them leads to controlled and predictable changes in the model s behavior. 4.1 Evaluating Reflection Hypothesis on Simple RNNs |
| Researcher Affiliation | Academia | 1Allen Institute, 2University of Washington, 3Télécom Paris, Institut Polytechnique de Paris, 4 Institute of Explainable Machine Learning, Helmholtz Munich 5 Department of Computational Neuroscience, Max Planck Institute for Biological Cybernetics 6 Institute for Human-Centered AI, Helmholtz Munich |
| Pseudocode | Yes | C Pseudocode of Learning Chunks Using Discrete Sequence Chunking Algorithm 1: Learn Chunks Algorithm 2: Neural Population State Chunking |
| Open Source Code | Yes | Implementation and code are publicly available at https://github.com/swu32/Chunk-Interpretability. |
| Open Datasets | Yes | We started with LLa MA3-8B [22]... We generalized our method for identifying concept-encoding neural population chunks to other large-scale models with distinct architectures and sequence-processing mechanisms, including the encoder-decoder model T5 (t5-small [66]), the RNN-based RWKV (rwkv-4-169m-pile) [70], and the state-space model Mamba (mamba-130m-hf [38])... ROCStories benchmark [62]... TREC Question Classification dataset [53]... Emma by Jane Austen, sourced from the Project Gutenberg corpus [40], accessed via NLTK [11]... Penn Treebank POS Tagset [55]. |
| Dataset Splits | Yes | We trained two identical RNNs on synthetic training sequences... We then further train the two RNNs in a transfer sequence... We assessed the extracted chunks and thresholds by measuring how well chunk detection predicted the occurrence of the word concept s... Similar to PA, we selected the activation thresholds of concept encoding neurons from training data and applied them to test data. by training on 20 stories following a schema (e.g., visit food location buy item eat react) and 13 control stories with narrative structure inconsistent with the schema (examples shown in Figure 6). We extracted shared subpopulation chunks at the end-of-sentence token using PA, and tested on 18 new schema-consistent and 15 inconsistent stories. |
| Hardware Specification | Yes | All experiments were conducted on a shared internal cluster equipped with NVIDIA Quadro RTX 6000. On a single NVIDIA RTX 3090, training for 1 epoch takes 30-60 seconds, and 100 epochs takes about 1-2 hours. All layers will take 32-64 hours if done sequentially. |
| Software Dependencies | No | The RNN is optimized with cross-entropy loss using Adam (learning rate = 0.005)... The model was trained for 160 iterations using Adam (lr = 0.005)... To achieve this, we extracted the POS tags for the corpus using the averaged perceptron tagger [11], following the Penn Treebank POS Tagset [55]. |
| Experiment Setup | Yes | The RNN is optimized with cross-entropy loss using Adam (learning rate = 0.005). Training is conducted on random subsequences of length 200 per batch, with the hidden state initialized to zero at the start of training. Since the neural subpopulation C(s), the chunk h C(s), and the deviation threshold depend on a tolerance parameter, we generate a series of increasingly stringent tolerance thresholds: toli = 2 0.8i, i = 0, 1, . . . , 39. To this end, we train a chunk dictionary D on LLa MA3 s hidden activity... then trained D (K = 2000, d = 4096) by minimizing the similarity loss function formulated in 3.1 individually for each layer of LLa MA-3. |