Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Beyond Components: Singular Vector-Based Interpretability of Transformer Circuits

Authors: Areeb Ahmad, Abhinav Joshi, Ashutosh Modi

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We validate our perspective on widely used standard tasks like Indirect Object Identification (IOI), Gender Pronoun (GP), and Greater Than (GT), showing that previously identified canonical functional heads, such as the name mover, encode multiple overlapping subfunctions aligned with distinct singular directions. ... We apply our method to a pretrained GPT-2 Small model [Radford et al., 2019], a tractable benchmark widely used in mechanistic interpretability [Wang et al., 2022, Hanna et al., 2023]. To inspect the generality of directional subfunctions, we evaluate the model on three representative tasks: Indirect Object Identification (IOI) [Wang et al., 2022], Gender Pronoun (GP) [Mathwin et al., 2023], and Greater Than (GT) [Hanna et al., 2023].
Researcher Affiliation	Academia	Areeb Ahmad Abhinav Joshi Ashutosh Modi Indian Institute of Technology Kanpur (IIT Kanpur) EMAIL
Pseudocode	Yes	Algorithm 1 Directional Mask Optimization via Singular Value Decomposition Require: Pretrained model fθ, dataset D = {(x, y)}, sparsity coefficient λ,
Open Source Code	Yes	We release our codebase for the experiments and additional results at https://github.com/Exploration-Lab/Beyond-Components.
Open Datasets	Yes	We apply our method to a pretrained GPT-2 Small model [Radford et al., 2019], a tractable benchmark widely used in mechanistic interpretability [Wang et al., 2022, Hanna et al., 2023]. To inspect the generality of directional subfunctions, we evaluate the model on three representative tasks: Indirect Object Identification (IOI) [Wang et al., 2022], Gender Pronoun (GP) [Mathwin et al., 2023], and Greater Than (GT) [Hanna et al., 2023]. Full dataset details are provided in App. A.
Dataset Splits	Yes	Table 2: Dataset splits for the three tasks used in our experiments, IOI (Indirect Object Identification), GT (Gender Type), and GP (Gender Pronouns), indicating the number of examples allocated to the training, validation, and test sets. Task Train Validation Test IOI 1k 200 1k GT 2k 500 2k GP 1k 155 307
Hardware Specification	Yes	All experiments were conducted on a single NVIDIA A40 GPU with 48 GB of VRAM.
Software Dependencies	Yes	Our implementation is based on Py Torch 2.1 [Paszke et al., 2019] and the Hugging Face Transformers library (v4.35), leveraging its integration with pre-trained models and tokenizer utilities. We utilize the GPT-2 small model (124M parameters), with all weights frozen during our experiments to ensure the integrity of the underlying representations. For systematic access to internal model activations and component-wise analysis, we employ the Transformer Lens library [Nanda and Bloom, 2022], which provides fine-grained control over the transformer s intermediate computations, enabling singular value decomposition and mask-based interventions at the component level.
Experiment Setup	Yes	Table 3: Hyperparameters used for training the linear probes and other learned components in our experiments. We report batch size, number of epochs, optimization parameters (learning rate and weight decay), and the coefficient of the L1 regularization term. Hyperparameter Value Batch Size 64 Number of Epochs 15 Learning Rate 1.0 10 2 Weight Decay 1.0 10 9 L1 Regularization Weight 1.5 10 4