Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Counterfactual reasoning: an analysis of in-context emergence

Authors: Moritz Miller, Bernhard Schölkopf, Siyuan Guo

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically show that in-context counterfactual reasoning emerges in Transformers in form of a designated noise abduction head (Section 4.4). Causal probing reveals that the residual stream linearly encodes the latent θ (Section 4.2). We find that data diversity in pre-training, self-attention and model depth are key for Transformers performance (Sections 4.2 and 4.3). More interestingly, our findings transfer to cyclic sequential data (Section 5), demonstrating concrete preliminary evidence that language models can perform counterfactual story generation in sequential data.
Researcher Affiliation Academia Moritz Miller12 Bernhard Schölkopf12 Siyuan Guo13 1Max Planck Institute for Intelligent Systems 2ETH Zurich 3 University of Cambridge
Pseudocode No The paper includes mathematical equations and descriptions of various processes but does not present any explicitly labeled pseudocode or algorithm blocks with structured, code-like steps.
Open Source Code Yes Our code is available under https://github.com/mrtzmllr/iccr.
Open Datasets No We use the counterfactual framework of Pearl [2009] and study a controlled synthetic setup similar to Garg et al. [2022]. That is, let y = f(x, uy) for some function f F, for function class F. The model predicts target y CF given counterfactual query x CF conditioned on a prompt sequence (x1, y1, . . . , xk, yk, z, x CF) where z is an index token indicating the position of the factual observation that such counterfactual query is based on.
Dataset Splits Yes We evaluate each model on unseen sequences sampled in-distribution and report results averaged over 6400 sequences. Our synthetic generation yields E[Y CF] = 0 and log std(Y CF) = 2.56. Appendices C and D include data generation particularities as well as experimental and model details.
Hardware Specification Yes We implement the experiments in pytorch [Paszke et al., 2019] and use one NVIDIA Ge Force RTX 3090 GPU for training.
Software Dependencies No We implement the experiments in pytorch [Paszke et al., 2019] and use one NVIDIA Ge Force RTX 3090 GPU for training. The paper mentions PyTorch and torchsde, but does not provide specific version numbers for these software components.
Experiment Setup Yes Our code is based on the repository by Garg et al. [2022]. We therefore adopt the learning rate of 10 4 for all function classes and models but use the Adam W optimizer [Loshchilov and Hutter, 2019] instead of Adam [Kingma and Ba, 2015]. We implement the experiments in pytorch [Paszke et al., 2019] and use one NVIDIA Ge Force RTX 3090 GPU for training. The conducted experiments require between 5 minutes and 2 hours of training depending on model setup and task complexity.