Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Task-specific experimental design for treatment effect estimation

Authors: Bethany Connolly, Kim Moore, Tobias Schwedes, Alexander Adam, Gary Willis, Ilya Feige, Christopher Frye

ICML 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Across a range of important tasks, realworld datasets, and sample sizes, our method outperforms other benchmarks, e.g. requiring an order-of-magnitude less data to match RCT performance on targeted marketing tasks.
Researcher Affiliation	Industry	Faculty, 160 Old Street, London, UK. Correspondence to: Christopher Frye <EMAIL>.
Pseudocode	Yes	Our full method, including discretisation (see Sec. 3.3), is detailed in Algorithm 1 and summarised in Fig. 2.
Open Source Code	No	The paper provides a tinyurl for processed data (https://tinyurl.com/RetailHero) and references open-source code for benchmark methods (https://github.com/raddanki/SampleConstrained-Treatment-Effect-Estimation), but does not explicitly state that its own methodology's source code is publicly available.
Open Datasets	Yes	The datasets we use for our experiments are described at length in App. B.1. In brief, we test our method on: STROKE: clinical trial evaluating aspirin s effect on stroke patients; our sub-selection procedure results in a dataset of size 9k (Sandercock et al., 2011). CRITEOVISIT & CRITEOCONVERSION: marketing trial evaluating effectiveness of email campaign on two different outcomes; we sub-select 7M rows of data (Diemert et al., 2018). RETAILHERO: marketing trial in which we engineered features from purchase history data for 200k individuals (see App. B.1 for references). Processed data: https://tinyurl.com/RetailHero
Dataset Splits	Yes	Additionally during model training, the sampled data was partitioned 80/20 into training/validation sets for early stopping (with early-stopping-rounds: 50). ... For all datasets except STROKE, we performed 384 trials per experiment, and we bootstrapresampled the test set for each trial. Because of its smaller size, experiments on STROKE each consisted of 1000 trials, and we performed a fresh train-test split for each trial.
Hardware Specification	Yes	All experiments were performed in parallel on 96 core, 393 GB machines.
Software Dependencies	No	The paper mentions software components like Adam (Kingma & Ba, 2015) and XGBoost, but does not specify their version numbers or other crucial software dependencies required for replication.
Experiment Setup	Yes	The VAE architecture we used in our experiments is comprised of a 2-layer fully-connected encoder with 100dimensional hidden layers, a 2-dimensional latent space... We trained the VAE using Adam (Kingma & Ba, 2015) with learning rate 10 4 and early stopping on the validation-set ELBO. ... We discretised the continuous latent representation... slicing each edge... into a number of cells (with a default of 20 unless stated otherwise). ... the core learners of the ITE estimator were XGBoost models initialised with following hyperparameters: n-estimators: 400 objective: binary:logistic eval-metric: rmse max-depth: 1 (T-learner), 2 (S-learner) Additionally during model training, the sampled data was partitioned 80/20 into training/validation sets for early stopping (with early-stopping-rounds: 50).