reproducibilityindex.ai

Identifying Causal-Effect Inference Failure with Uncertainty-Aware Models

Authors: Andrew Jesson, Sören Mindermann, Uri Shalit, Yarin Gal

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we show empirical evidence for the following claims: that our uncertainty aware methods are robust both to violations of the overlap assumption and a failure mode of propensity based trimming (6.1); that they indicate high uncertainty when covariate shifts occur between training and test distributions (6.2); and that they yield lower CATE estimation errors while rejecting fewer points than propensity based trimming (6.2).
Researcher Affiliation	Academia	Andrew Jesson Department of Computer Science University of Oxford Oxford, UK OX1 3QD andrew.jesson@cs.ox.ac.uk Sören Mindermann Department of Computer Science University of Oxford Oxford, UK OX1 3QD soren.mindermann@cs.ox.ac.uk Uri Shalit Technion Haifa, Israel 3200003 urishalit@technion.ac.il Yarin Gal Department of Computer Science University of Oxford Oxford, UK OX1 3QD yarin.gal@cs.ox.ac.uk
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code	Yes	The preceding results can be reproduced using publicly available code3. 3Available at: https://github.com/OATML/ucate
Open Datasets	Yes	We introduce a new, high-dimensional, individual-level causal effect prediction benchmark dataset called CEMNIST to demonstrate robustness to overlap and propensity failure (6.1). Finally, we introduce a modiﬁcation to the IHDP causal inference benchmark to explore covariate shift (6.2). We report results for the unaltered IHDP dataset in ﬁgure 3a and the l.h.s. of table 2. This supports that uncertainty rejection is more data-efﬁcient, i.e., errors are lower while rejecting less. This is further supported by the results on ACIC 2016 [11] (ﬁgure 3c and the r.h.s. of table 2).
Dataset Splits	Yes	We generated 1000 datasets following the original IHDP experimental setup, each with 747 data points randomly split into training (600) and test (147) sets, each of which contains 561 treated and 186 control patients.
Hardware Specification	No	The paper does not specify any hardware details such as CPU/GPU models, memory, or specific computing environments used for experiments.
Software Dependencies	No	The paper mentions general software like 'TensorFlow' (in references) and 'Python' (for data generation) but does not provide specific version numbers for key software components or libraries used in their experimental setup.
Experiment Setup	Yes	MC Dropout is a simple change to existing methods. Gal & Ghahramramani [15] showed that we can simply add dropout [52] with L2 regularization in each of ω0, ω1 during training and then sample from the same dropout distribution at test time to get samples from q(ω0, ω1\|D). With tuning of the dropout probability, this is equivalent to sampling from a Bernoulli approximate posterior q(ω0, ω1\|D) (with standard Gaussian prior). MC Dropout has been used in various applications [60, 38, 28]. ... Appendix B.1 describes all models. All models were trained for 100 epochs using the Adam optimizer with default parameters.