Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Prediction-Powered Causal Inferences

Authors: Riccardo Cadei, Ilker Demirel, Piersilvio De Bartolomeis, Lukas Lindorfer, Sylvia Cremer, Cordelia Schmid, Francesco Locatello

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We validate our method on synthetic and real-world scientific data, solving impossible problem instances for Empirical Risk Minimization even with standard invariance constraints. In particular, for the first time, we achieve valid causal inference on a scientific experiment with complex recording and no human annotations, fine-tuning a foundational model on our similar annotated experiment. [...] To test it, we considered ISTAnt experiment [Cadei et al., 2024] (unique real-world benchmark for treatment effect estimation with complex measurements), ignoring the outcome annotations, and trained the predictive model over a new annotated experiment of ours with the same annotation mechanism, but different recording platform (lower quality) and treatments. We further validate and confirm the results on a synthetic manipulation of MNIST dataset [Le Cun, 1998] by controlling the data-generating process and causal effect.
Researcher Affiliation Academia 1Institute of Science and Technology Austria (ISTA) 2Massachusetts Institute of Technology (MIT) 3Department of Computer Science, ETH Zurich 4INRIA, Ecole Normale Supérieure, CNRS, PSL Research University Equal contribution.
Pseudocode No The paper describes methods textually and mathematically but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes We additionally share the Python implementation of all the experiments in the Supplementary Materials. [...] We share all the code implementation in the supplementary material.
Open Datasets Yes Our new experimental ecology dataset preview is anonymously shared on Figshare at https://figshare.com/s/9a490b6f6eeebd73350b. We further rely on ISTAnt dataset publicly available at https://doi.org/10.6084/m9.figshare.26484934.v2. The synthetic experiments on Causal MNIST relies on MNIST dataset Le Cun [1998], publicly available.
Dataset Splits Yes We sampled 10 000 observations from PA to train a digits classifier (a Convolutional Neural Network) and tested it in PPCI in-distribution (10 000 more sample from PA) and out-of-distribution (zeroshot) on 10 000 obervations for each PB, PC, PD, PE. [...] selecting the best-performing hyper-parameters for each model-method, minimizing the Treatment Effect Bias on the training sample, while guaranteeing good predictive performances, i.e., accuracy greater than 0.8, on a small validation set (1 000 random frames).
Hardware Specification Yes We run all the analyses using 48GB of RAM, 20 CPU cores, and a single node GPU (NVIDIA Ge Force RTX2080Ti). [...] We run all the analysis using 10GB of RAM, 8 CPU cores, and a single node GPU (NVIDIA Ge Force RTX2080Ti).
Software Dependencies No The paper mentions software like 'Adam optimizer', 'XGBoost', 'AIPW', 'X-Learner', 'BART', and 'Causal Forest' but does not specify their version numbers.
Experiment Setup Yes For each pre-trained encoder, we fine-tuned a multi-layer perception head (2 hidden layers with 256 nodes each and Re LU activation) on top of its class token via Adam optimizer (β1 = 0.9, β2 = 0.9, ϵ = 10 8) for ERM, v REx (finetuning the invariance constraint in {0.01, 0.1, 1, 10}) and DERM (ours) for 15 epochs and batch size 256. So, we fine-tuned the learning rates in [0.0005, 0.5] [...] Table 4: Training details for the Convolutaional Neural Network training on Causal MNIST. Hyper-parameters Value Loss Cross Entropy Learning Rate 0.0001 Optimizer Adam (β1 = 0.9, β2 = 0.9, ϵ = 10 8) Batch Size 32 Epochs 40