Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

On Doubly Robust Inference for Double Machine Learning in Semiparametric Regression

Authors: Oliver Dukes, Stijn Vansteelandt, David Whitney

JMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In order to judge how well the methods are expected to perform in practice, we conducted three simulation experiments.
Researcher Affiliation Collaboration Oliver Dukes EMAIL Department of Applied Mathematics, Computer Science and Statistics Ghent University 9000 Ghent, Belgium Stijn Vansteelandt EMAIL Department of Applied Mathematics, Computer Science and Statistics Ghent University 9000 Ghent, Belgium David Whitney EMAIL GSK Gunnels Wood Road, Stevenage, SG1 2NY, U.K.
Pseudocode Yes We describe how this can be done below: 1. Divide the sample into disjoint parts Ik each of size nk = n/K, where K is a fixed integer (and assuming n is a multiple of K). For each Ik, let Ic k denote all indices that are not in Ik. 2. Obtain the machine learning estimates ˆgc k(L) and ˆmc k(L) from Ic k. 3. Obtain the estimates ˆGk(L) and ˆ Mk(L) from Ik. 4. Obtain the estimates ˆαk and ˆβk via solving the equations: ... 5. For all i in Ik, obtain the score ... 6. Construct a test statistic...
Open Source Code No The paper mentions using the 'hdm package in R' but does not provide any statement or link to source code specifically for the methodology described in this paper by the authors. The text states: 'we implemented using the hdm package in R (Chernozhukov et al., 2016).'
Open Datasets No The paper describes how data was generated for simulation studies, rather than utilizing pre-existing public datasets. For Experiment 1, it states: 'The first covariate L1 was generated from a U( 2, 2) distribution, whilst the second covariate L2 and exposure were both binary with respective expectations 0.5 and expit{ L1 + 2L1L2}. The outcome Y was simulated from a N( L1 + 2L1L2, 1) distribution.'
Dataset Splits Yes Five-fold cross-fitting was used in the construction of each of the tests. 1. Divide the sample into disjoint parts Ik each of size nk = n/K, where K is a fixed integer (and assuming n is a multiple of K). For each Ik, let Ic k denote all indices that are not in Ik.
Hardware Specification No The paper does not provide specific details about the hardware used to run the experiments, such as GPU/CPU models or memory specifications.
Software Dependencies No The paper mentions the use of 'the hdm package in R (Chernozhukov et al., 2016)' but does not specify version numbers for R or the hdm package itself, which is required for reproducibility.
Experiment Setup Yes The parameter that was consistently estimated was obtained using the Super Learner (van der Laan et al., 2007), whilst the inconsistently estimated parameter was obtained via ℓ1 penalised maximum likelihood with an omitted interaction term. Tuning parameters were selected using cross-validation... The parameters ζγ and ζβ were first both fixed at 0.82; we then considered a more challenging setting by lowering to ζβ = 0.2... We also reversed this, setting ζβ = 0.82 and ζγ = 0.2.