Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Estimating Interventional Distributions with Uncertain Causal Graphs through Meta-Learning

Authors: Anish Dhir, Cristiana Diaconu, Valentinian Lungu, James Requeima, Richard E Turner, Mark van der Wilk

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirically, we demonstrate that MACE-TNP outperforms strong Bayesian baselines. Our work establishes meta-learning as a flexible and scalable paradigm for approximating complex Bayesian causal inference, that can be scaled to increasingly challenging settings in the future. ... We evaluate the performance of our model, MACE-TNP, against Bayesian causal inference baselines, and a causal discovery method that selects a single graph. With our experiments we aim to answer: 1) When analytically tractable, can we confirm that our model recovers the true posterior interventional distribution under identifiability and non-identifiability of the causal graph, 2) How does our model compare against baselines when the baselines assumptions are respected and when they are violated, 3) How does our model perform when the number of nodes are scaled, 4) How does our model perform when we do not have knowledge of the data generating process? ... Finally, we apply our proposed method on the Sachs proteomics dataset [55], which includes measurements of D = 11 proteins from thousands of cells under various molecular interventions.
Researcher Affiliation	Academia	Anish Dhir Imperial College London Cristiana Diaconu University of Cambridge Valentinian Mihai Lungu University of Cambridge James Requeima University of Toronto Vector Institute Richard E. Turner University of Cambridge Alan Turing Institute Mark van der Wilk University of Oxford
Pseudocode	No	The paper describes the architecture and procedures in detail using prose, equations, and diagrams (Figure 2), but no explicitly labeled 'Pseudocode' or 'Algorithm' block is present.
Open Source Code	Yes	Code for our experiments is available at: https://github.com/Anish144/Causal Inference Neural Process.
Open Datasets	Yes	Finally, we apply our proposed method on the Sachs proteomics dataset [55], which includes measurements of D = 11 proteins from thousands of cells under various molecular interventions.
Dataset Splits	Yes	To train MACE-TNP, we randomise the number of observational samples Nobs U{50, 750}, and set Nint = 1000 Nobs. The training loss is evaluated on these Nint samples. For testing, we sample 500 observation points and compute the loss against 500 intervention points. ... We train the model for 1 epoch on 50.000 datasets and test on 100 datasets. ... We train the model for 2 epochs on 50.000 datasets for the GP experiment and 100.000 datasets for the NN one, and test on 100 datasets in both cases.
Hardware Specification	Yes	For the twoand three-node experiments, we ran both training and inference on a single NVIDIA Ge Force RTX 2080 Ti (11 GB) with 20 CPU cores on a shared cluster. The only exception was for our largest three-node GP and NN models (with dmodel = 1024), where we used a single NVIDIA RTX 6000 Ada Generation (50 GB) paired with 56 CPU cores; those models required roughly 25 GB of GPU memory. For the higher-node experiments, we used a single NVIDIA A100 80GB GPU, as well as an RTX 4090 24GB GPU.
Software Dependencies	No	The paper describes the model architecture and training details, but does not explicitly list specific software dependencies with their version numbers in the main text or appendices.
Experiment Setup	Yes	Throughout our experiments we use H = 8 attention heads, each of dimension DQ = DKV = dmodel/8. The MLPs used in the encoding use two layers and a hidden dimension of dembed = dmodel. Unless otherwise specified, we use a learning rate of 5 10 4 with a linear warmup of 2% of the total iterations, and a batch size of 32.