Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Causal Discovery and Inference through Next-Token Prediction

Authors: Eivinas Butkus, Nikolaus Kriegeskorte

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Here we demonstrate that a GPT-style transformer trained for next-token prediction can simultaneously discover instances of linear Gaussian structural causal models (SCMs) and learn to answer counterfactual queries about those SCMs. First, we show that the network generalizes to counterfactual queries about SCMs for which it has seen interventional data but not any examples of counterfactual inference. The network must, thus, have successfully composed discovered causal structures with a learned counterfactual inference algorithm. Second, we decode the implicit mental SCM from the network s residual stream activations and manipulate it using gradient descent with predictable effects on the network s output. Our results suggest that statistical prediction may be sufficient to drive the emergence of internal causal models and causal inference capacities in deep neural networks.
Researcher Affiliation	Academia	Eivinas Butkus1,2 Nikolaus Kriegeskorte1,2 EMAIL EMAIL 1Columbia University 2NSF AI Institute for Artificial and Natural Intelligence
Pseudocode	Yes	Algorithm 1 Gradient-Based Residual Stream Intervention
Open Source Code	Yes	https://github.com/eivinasbutkus/causal-discovery-and-inference-through-nex t-token-prediction
Open Datasets	No	We use the structural causal model (SCM) formalism [35] to synthesize our training data. Prior work has shown that transformers trained on synthetic data can uncover hierarchical or compositional structure. For instance, Murty et al. [29] show that transformers can learn hierarchical syntactic rules through extended training, while Lake and Baroni [19] demonstrate systematic generalization to novel combinations through meta-learning on algebraic reasoning tasks with compositional structure. SCMs are distinct in that they encode mechanistic causal relationships that support interventional (L2) and counterfactual (L3) reasoning within Pearl s Causal Hierarchy [2]. Training on SCM generated data allows us to put Pearl s theoretical claims about the limitations of deep neural networks [38, 36, 37] to a direct test.
Dataset Splits	Yes	For this purpose, we devised a generalization challenge by randomly choosing a held-out set of 1,000 SCMs (denoted Dtest ) for which the model only saw DATA strings during training (Fig. 4). If the trained model can answer counterfactual queries about this test set, it means that it has (1) learned a more general counterfactual inference engine, (2) built a shared representation for interventional data and counterfactual inference, and (3) discovered the causal structure of SCMs within the Dtest set from interventional data strings. In other words, it can compose the learned counterfactual inference engine from Dtrain strings with the discovered causal structure from interventional data strings in Dtest. ... We split Dtrain set (Fig. 4) into Dprobe train (57,049 SCMs) and Dprobe valid (1,000 SCMs) sets.
Hardware Specification	Yes	Our model requires approximately 3GB VRAM and can be trained on consumer-grade hardware. Each epoch processes approximately 1.2 million examples in 10 minutes. We used one NVIDIA L40 GPU to train the final models within a university cluster using Py Torch 2.3. Evaluations, probe training, and interventional analyses were performed using a desktop machine with NVIDIA Ge Force RTX 2080 Ti (10GB VRAM).
Software Dependencies	Yes	We used one NVIDIA L40 GPU to train the final models within a university cluster using Py Torch 2.3.
Experiment Setup	Yes	Our transformer model has 12 layers, hidden size 512, 8 attention heads of size 64, MLP size 2048, GELU activation function, and Pre-LN type layer normalization. We use Adam W [25] with learning rate 10 5, betas [0.9, 0.999], eps 10 8, and 0.001 weight decay. We set batch size to 128 and train for different number of epochs depending on whether the variable naming scheme is fixed or shuffled (see 3.2). When variable names are fixed, we train for 300 epochs, reducing learning rate to 10 6 for the last 10 epochs. When variable names are shuffled, the model takes longer to converge, so we train for 1,500 epochs, reducing learning rate to 10 6 for the last 100 epochs.