Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

LLM-Driven Treatment Effect Estimation Under Inference Time Text Confounding

Authors: Yuchen Ma, Dennis Frauen, Jonas Schweisthal, Stefan Feuerriegel

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	6 ExperimentDatasets: We use the following datasets from medical practice for benchmarking: (i) The International Stroke Trial (IST) [53] is one of the largest randomized controlled trials in acute stroke treatment. The dataset comprises 19, 435 patients. (ii) MIMIC-III [30] is a large, single-center database comprising information relating to patients admitted to critical care units at a large tertiary care hospital. MIMIC-III contains 38, 597 distinct adult patients. We adhere to the terms and conditions governing the use of the MIMIC dataset (in particular, our analysis is HIPAA compliant). Details are in Appendix E.Due to the fundamental problem of causal inference, the counterfactual outcomes are never observed in real-world data. We thus follow prior literature (e.g.,[6, 7, 33, 39, 54]) and benchmark our model using semi-synthetic datasets. Details of datasets are in Appendix D.
Researcher Affiliation	Academia	Yuchen Ma, Dennis Frauen, Jonas Schweisthal & Stefan Feuerriegel Munich Center for Machine Learning LMU Munich EMAIL
Pseudocode	Yes	Algorithm 1: TCA for CATE estimation with inference time text confounding.
Open Source Code	Yes	Code is available at https://github.com/yccm/llm-tca
Open Datasets	Yes	Datasets: We use the following datasets from medical practice for benchmarking: (i) The International Stroke Trial (IST) [53] is one of the largest randomized controlled trials in acute stroke treatment. The dataset comprises 19, 435 patients. (ii) MIMIC-III [30] is a large, single-center database comprising information relating to patients admitted to critical care units at a large tertiary care hospital. MIMIC-III contains 38, 597 distinct adult patients.
Dataset Splits	No	The paper mentions "training data DX = (xi, ai, yi)n i=1" and "test data DT = (tj, aj, yj)m j=1", which implies a split, but does not specify the method or percentages for how the data is split into training and test sets.
Hardware Specification	Yes	Experiments were carried out on 2 GPUs (NVIDIA A100-PCIE-40GB) with Intel Xeon Silver 4316 CPUs.
Software Dependencies	No	Text generation: Given structured clinical confounders X, we generate the induced text confounder T using GPT-4o mini through the Open AI API. Text embedding: We convert generated text T into dense vectors using pretrained BERT (bertbase-uncased) from Hugging Face Transformers. The paper mentions specific LLM models and a BERT model but does not provide specific version numbers for software libraries like PyTorch, TensorFlow, or the Hugging Face Transformers library itself.
Experiment Setup	Yes	Model architecture and training: Propensity scores ˆπx(x) are estimated via logistic regression The CATE predictor is a 3-layer MLP with Re LU activation, batch normalization, and 0.3 dropout. We use Adam optimizer with learning rate 5e 5. Models are trained for 100 epochs on IST and 150 on MIMIC-III with a batch size of 512. We apply label smoothing α = 0.1 and gradient clipping (max norm=1.0)