Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Abstract Counterfactuals for Language Model Agents

Authors: Edoardo Pona, Milad Kazemi, Yali Du, David Watson, Nicola Paoletti

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments demonstrate that the approach produces consistent and meaningful counterfactuals while minimising the undesired side effects of token-level methods. We conduct experiments on text-based games and counterfactual text generation, while considering both token-level and latent-space interventions.
Researcher Affiliation Academia Edoardo Pona King s College in London EMAIL Milad Kazemi King s College in London EMAIL Yali Du King s College in London EMAIL David Watson King s College in London EMAIL Nicola Paoletti King s College in London EMAIL
Pseudocode No 3.1 Inference method Given a factual state s, let a A | s be the observed action (from now, we omit the time-step indices for brevity). The goal of ACF is to compute a counterfactual action A in a different state s given a, but without performing abduction on the token-level mechanism f A (which is semantics-agnostic). To do so, ACF derives a counterfactual Y for the observed abstraction value y by performing abduction over the combined mechanism f Y f A. Note that ACF s abduction step is conditioned only on the abstraction y, not the action a. Such obtained Y represents the abstraction of the, yet unknown, counterfactual action A . The latter is found by mapping back Y into the action space, i.e., by deriving the posterior distribution of A given Y and s . In summary, ACF s inference procedure consists of the following three steps: 1. Abduction: derive the posterior distribution of the exogenous noise for Y , U Y = UY | s, y, given the observation s, y. Formally, this is given by... 2. Counterfactual inference of Y : For a given counterfactual state s , we plug in the above posterior U Y to obtain a distribution of the counterfactual abstraction Y = Y | s , U Y as follows:... 3. Mapping Y back into the action space: in the final step, we derive the counterfactual action A in a way that its distribution is consistent with the distribution of the counterfactual abstraction Y derived in step 2. First, we compute the posterior...
Open Source Code Yes 1code MIT; data research-only. We provide a .zip file as part of the submission with the full codebase required for running and evaluating the results, as well as documentation on how to use it.
Open Datasets Yes We evaluate our approach on three benchmarks: MACHIAVELLI [23],1 a choice-based game for evaluating agents social decision making, and two open-text tasks, involving the generation of short biographies [8]2 and Reddit comments [9], 3 respectively.
Dataset Splits No The paper mentions evaluating on a 'random sample of 250 biographies' but does not specify train/test/validation splits for the datasets used in their experiments. It mentions fine-tuning classifiers on Bios and Go Emotions datasets but not the specific splits used.
Hardware Specification Yes We run all the reported experiments on a server equipped with an x86_64, 128-core CPU with 405.2 GB of RAM and an NVIDIA A40 GPU with 48GB of VRAM. The server runs Ubuntu 20.04.6 LTS.
Software Dependencies Yes Our agent is implemented by the OLMo-1B LLM [12]. We evaluate our method on the GPT2-XL [25] and LLama3.2-1B [11] LLMs. implemented by fine-tuning a Distil BERT [30] language model. As our embedding model λ, we use the all-mpnet-base-v2 model from the sentence-transformers library [28]. For the experiments throughout the paper we use gpt-4o-mini as language model.
Experiment Setup Yes In this setting, an intervention consists in replacing the factual state s with s = (x, θ ) where the model parameters have been modified according to the Mi Mi C [31] transformation and the prompt has been left unchanged. We evaluate our method on a random sample of 250 biographies, and observe in table 1 that ACF exhibits much higher abstraction value consistency from factual to counterfactual settings both with supervised and unsupervised abstractions compare to the token-level alternative. Our agent is implemented by the OLMo-1B LLM [12]. We run our method both with an abstraction learned in an unsupervised manner (described in appendix F) as well as a supervised one. For the gender steering latent space interventions we fine-tune the model to predict the protagonist s profession, using the Bios Bias [8] dataset, resulting in a model with an f1 score of 0.85.