Task-specific experimental design for treatment effect estimation

Authors: Bethany Connolly, Kim Moore, Tobias Schwedes, Alexander Adam, Gary Willis, Ilya Feige, Christopher Frye

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Across a range of important tasks, realworld datasets, and sample sizes, our method outperforms other benchmarks, e.g. requiring an order-of-magnitude less data to match RCT performance on targeted marketing tasks.
Researcher Affiliation Industry Faculty, 160 Old Street, London, UK. Correspondence to: Christopher Frye <chris.f@faculty.ai>.
Pseudocode Yes Our full method, including discretisation (see Sec. 3.3), is detailed in Algorithm 1 and summarised in Fig. 2.
Open Source Code No The paper provides a tinyurl for processed data (https://tinyurl.com/RetailHero) and references open-source code for benchmark methods (https://github.com/raddanki/SampleConstrained-Treatment-Effect-Estimation), but does not explicitly state that its *own* methodology's source code is publicly available.
Open Datasets Yes The datasets we use for our experiments are described at length in App. B.1. In brief, we test our method on: STROKE: clinical trial evaluating aspirin s effect on stroke patients; our sub-selection procedure results in a dataset of size 9k (Sandercock et al., 2011). CRITEOVISIT & CRITEOCONVERSION: marketing trial evaluating effectiveness of email campaign on two different outcomes; we sub-select 7M rows of data (Diemert et al., 2018). RETAILHERO: marketing trial in which we engineered features from purchase history data for 200k individuals (see App. B.1 for references). Processed data: https://tinyurl.com/RetailHero
Dataset Splits Yes Additionally during model training, the sampled data was partitioned 80/20 into training/validation sets for early stopping (with early-stopping-rounds: 50). ... For all datasets except STROKE, we performed 384 trials per experiment, and we bootstrapresampled the test set for each trial. Because of its smaller size, experiments on STROKE each consisted of 1000 trials, and we performed a fresh train-test split for each trial.
Hardware Specification Yes All experiments were performed in parallel on 96 core, 393 GB machines.
Software Dependencies No The paper mentions software components like Adam (Kingma & Ba, 2015) and XGBoost, but does not specify their version numbers or other crucial software dependencies required for replication.
Experiment Setup Yes The VAE architecture we used in our experiments is comprised of a 2-layer fully-connected encoder with 100dimensional hidden layers, a 2-dimensional latent space... We trained the VAE using Adam (Kingma & Ba, 2015) with learning rate 10 4 and early stopping on the validation-set ELBO. ... We discretised the continuous latent representation... slicing each edge... into a number of cells (with a default of 20 unless stated otherwise). ... the core learners of the ITE estimator were XGBoost models initialised with following hyperparameters: n-estimators: 400 objective: binary:logistic eval-metric: rmse max-depth: 1 (T-learner), 2 (S-learner) Additionally during model training, the sampled data was partitioned 80/20 into training/validation sets for early stopping (with early-stopping-rounds: 50).