Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

SPEX: Scaling Feature Interaction Explanations for LLMs

Authors: Justin Singh Kang, Landon Butler, Abhineet Agarwal, Yigit Efe Erginbas, Ramtin Pedarsani, Bin Yu, Kannan Ramchandran

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We perform experiments across three difficult long-context datasets that require LLMs to utilize interactions between inputs to complete the task. For large inputs, SPEX outperforms marginal attribution methods by up to 20% in terms of faithfully reconstructing LLM outputs. Further, SPEX successfully identifies key features and interactions that strongly influence model output.
Researcher Affiliation Academia 1Department of Electrical Engineering and Computer Science, UC Berkeley 2Department of Statistics, UC Berkeley 3Department of Electrical and Computer Engineering, UC Santa Barbara. Correspondence to: Justin Singh Kang <justin EMAIL>.
Pseudocode Yes A. Algorithm Details Algorithm 1 Collect Samples Algorithm 2 BCH Hard Decode Algorithm 3 BCH Soft Decode (Chase Decoding) Algorithm 4 Message Passing
Open Source Code Yes 2https://github.com/basics-lab/ spectral-explain
Open Datasets Yes 1. Sentiment is primarily composed of the Large Movie Review Dataset (Maas et al., 2011), which contains both positive and negative IMDb movie reviews. The dataset is augmented with examples from the SST dataset (Socher et al., 2013) to ensure coverage for small n. 2. Hotpot QA (Yang et al., 2018) is a question-answering dataset requiring multi-hop reasoning over multiple Wikipedia articles to answer complex questions. 3. Discrete Reasoning Over Paragraphs (DROP) (Dua et al., 2019) is a comprehension benchmark requiring discrete reasoning operations like addition, counting, and sorting over paragraph-level content to answer questions.
Dataset Splits No The paper mentions data grouping for evaluation (e.g., "160 reviews were categorized using their word counts into 8 groups..."), and 'random test masks' for faithfulness, but it does not specify explicit train/test/validation splits for the main models used (Llama-3.2-3B-Instruct, Distil BERT) or for SPEX itself. For baselines, it mentions '5-fold cross-validation' for regularization parameter selection, but not for the primary experimental setup.
Hardware Specification Yes Experiments are run on a server using Nvidia L40S GPUs and A100 GPUs.
Software Dependencies No We make use of the default word and sentence tokenizer from nltk (Bird et al., 2009). To fit regressions, we use the scikit-learn (Pedregosa et al., 2011) implementations of Linear Regression and Ridge CV. We use the software package galois (Hostetter, 2020) to construct a generator matrix... No specific version numbers for nltk, scikit-learn, or galois are provided.
Experiment Setup Yes Hyperparameters SPEX has several parameters to determine the number of model inferences (masks). We choose C = 3, informed by Li et al. (2014) under a simplified sparse Fourier setting. We fix t = 5, which is the error correction capability of SPEX and serves as an approximate bound on the maximum degree. We also set b = 8; the total collected samples are C2bt log(n). For ℓ1 regression-based interaction indices, we choose the regularization parameter via 5-fold cross-validation. For DROP and Hotpot QA, (generative question-answering tasks) we use Llama-3.2-3B-Instruct (Grattafiori et al., 2024) with 8-bit quantization. For Sentiment (classification), we use the encoder-only fine-tuned Distil BERT model (Sanh et al., 2019; Odabasi, 2025). When masking, we replace the word with the [UNK] token. In sentiment analysis, ... f(x S) is the logit of the positive class. For text generation tasks, we use the well-established practice of scalarizing generated text using the negative log-perplexity. The following prompt format for Hotpot QA: Title: {title 1} Content: {document 1} ... Query: {question}. The following prompt format for DROP: Context: {context} Query: {question}. The simplified trolley problem: System: Answer with the one word True or False only. User: {Masked Input} True or False: You should not pull the lever. For VQA, Gaussian blur was applied to the masked cells.