Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Quantifying Ignorance in Individual-Level Causal-Effect Estimates under Hidden Confounding
Authors: Andrew Jesson, Sören Mindermann, Yarin Gal, Uri Shalit
ICML 2021 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We prove that our estimator converges to tight bounds on CATE when there may be unobserved confounding and assess it using semi-synthetic, high-dimensional datasets. |
| Researcher Affiliation | Academia | 1OAMTL, University of Oxford 2Machine Learning and Causal Inference in Healthcare Lab, Technion Israel Institute of Technology. |
| Pseudocode | No | The paper describes the steps for computing the interval estimator in Section 3.3, but these steps are presented as descriptive text rather than a formally structured pseudocode or algorithm block. |
| Open Source Code | Yes | Code availability: The code is available at https://github.com/andrewjesson/HiddenConfoundingCATE |
| Open Datasets | Yes | For this experiment, we adopt the one-dimensional simulated setting into a high-dimensional setting C.2. Specifically, we assign to each image of the MNIST dataset (Le Cun, 1998) a latent feature φ [ 2, 2] as follows: all images of the digits 0 are assigned a φ [ 2, 1.6], all images 1 have φ [ 1.6, 1.2], and so on up to the digit 9. [...] To this end we use the IHDP dataset (Hill, 2011) as Jesson et al. (2020) show that low overlap and/or similarity are problems for IHDP. |
| Dataset Splits | Yes | The average and 95% confidence intervals over 50 random realizations of training (n = 1000), validation (n = 100), and test (n = 1000) datasets are reported. |
| Hardware Specification | No | The paper states, 'Details for each experiment, including architectures, hyper-parameter tuning, training procedures, and compute infrastructure are detailed in Appendix D.' However, Appendix D is not provided in the given text, so no specific hardware details are available in the main body. |
| Software Dependencies | No | The paper mentions software like 'Deep Ensembles' and 'Pytorch', but it does not specify any version numbers for these or other software dependencies, which is required for reproducibility. |
| Experiment Setup | No | The paper notes that 'Details for each experiment, including architectures, hyper-parameter tuning, training procedures, and compute infrastructure are detailed in Appendix D.' However, Appendix D is not provided in the given text, thus specific experimental setup details like hyperparameter values are not explicitly stated in the main body. |