Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Pinpointing Attention-Causal Communication in Language Models

Authors: Gabriel Franco, Mark Crovella

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We utilize models spanning various architectures and scales: GPT-2 small [19], Pythia-160M [58], and Gemma-2 2B [55]. These models are evaluated on multiple tasks: Indirect Object Identification (IOI) [45], the Greater Than (GT) task3 [59], and Gender Pronoun (GP) [60]. These models and tasks cover a wide range of studies in mechanistic interpretability, specifically in circuit discovery [14, 61 63]. Figure 1 illustrates the resulting distributions of the fraction of singular vectors used to construct Sℓads, ie, the distribution of \|Sℓads\|/R. Figure 2 shows typical results; full results are in Appendix J. We find that ablating any of the test signals leads to performance decreases, and boosting the signal leads to performance improvements, across the three models and three tasks.
Researcher Affiliation	Academia	Gabriel Franco Department of Computer Science Boston University EMAIL Mark Crovella Department of Computer Science and Faculty of Computing & Data Sciences Boston University EMAIL
Pseudocode	Yes	Algorithm 1 outlines the pseudocode for the complete process of construction communication graphs.
Open Source Code	Yes	We emphasize that all the methods in the paper extend to models having attention bias terms and using Ro PE, and we provide code 2 implementing our methods for those models. 2Code available at https://github.com/gaabrielfranco/ pinpointing-attention-causal-communication
Open Datasets	Yes	We utilize models spanning various architectures and scales: GPT-2 small [19], Pythia-160M [58], and Gemma-2 2B [55]. These models are evaluated on multiple tasks: Indirect Object Identification (IOI) [45], the Greater Than (GT) task3 [59], and Gender Pronoun (GP) [60].
Dataset Splits	No	We used the authors code to generate 256 prompts for this task, using a mix of the ABBA and BABA templates. ... We used the authors code with the 100 provided examples. ... We utilized the authors provided code to generate 256 prompts for this specific task. This describes prompt generation/selection for evaluation, not dataset splits for training/validation/testing.
Hardware Specification	Yes	On CPU hardware (machines with 28 cores), tracing a 22-token prompt (the largest prompt size across all tasks that we used) takes approximately one minute for GPT-2 and for Pythia, and about one hour for Gemma-2.
Software Dependencies	No	Code for our method is available at https://github.com/gaabrielfranco/ pinpointing-attention-causal-communication; its implementation was enabled by the Transformer Lens library [64].
Experiment Setup	Yes	The parameter β is set based on the degree to which the less important edges should be filtered from the communication graph. In practice, we use β = 0.7 in all our results, which filters most of the low-weight edges while preserving the largest-weight edges.